68 Commits

Author SHA1 Message Date
claude-noether 8f9487d355 ffmpeg-v4l2-request-fourier: route H.264 chroma intra deblock (4:2:0) through daedalus-fourier (0013)
Substitutes c->v_loop_filter_chroma_intra and c->h_loop_filter_chroma_intra
with daedalus wrappers in the bit_depth=8 / chroma_format_idc<=1 (4:2:0)
branch.  4:2:2 stays on the in-tree NEON path (the daedalus chroma intra
dispatch is 4:2:0-only).

The fourier dispatches were exposed in PR #11 (DEFINE_INTRA_DISPATCH
macro generates the public daedalus_dispatch_h264_deblock_chroma_*_intra
symbols + recipe wrappers).

Re-architects the chroma init: v_loop_filter_chroma_intra was previously
assigned unconditionally to the NEON variant (which works for both 4:2:0
and 4:2:2).  We now assign it INSIDE both branches of the chroma_format_idc
conditional — 4:2:0 picks daedalus, 4:2:2 keeps NEON.  No regression for
4:2:2 streams.

Same NEON-to-NEON via recipe shape as 0010 luma intra.

Closes the deblock substitution layer for the 4:2:0 / 8-bit hot path:
- 0005 luma_v non-intra ✓
- 0008 luma_h non-intra ✓
- 0009 chroma_v / chroma_h non-intra ✓
- 0010 luma_v / luma_h intra ✓
- 0013 chroma_v / chroma_h intra ✓

All 8 deblock variants for the common 4:2:0 path now route through
daedalus.  4:2:2 chroma + the chroma422 mbaff variants stay on in-tree
NEON.

Verified the patch applies cleanly on top of 0001-0012 against the
pinned upstream commit b57fbbe5 on hertz.
2026-05-25 14:22:02 +02:00
marfrit f07824adb7 Merge pull request 'ffmpeg-v4l2-request-fourier: route remaining H.264 qpel 8x8 positions through daedalus-fourier (0012)' (#101) from claude-noether/marfrit-packages:noether/h264-substitute-qpel-rest into main
Reviewed-on: marfrit/marfrit-packages#101
2026-05-25 12:19:55 +00:00
claude-noether 2732a022f8 ffmpeg-v4l2-request-fourier: route remaining H.264 qpel 8x8 positions through daedalus-fourier (0012)
Closes the H.264 qpel substitution.  Extends 0007 (which routed only
mc20 put_) to ALL 15 useful positions in BOTH the put_ and avg_
tables, skipping mc00 (integer copy / pointer-only fast path).

29 substitutions total: 14 new put_ + 15 avg_.  Each wraps a single
daedalus_recipe_dispatch_h264_qpel_{avg_,}mcXY call (the dispatches
landed in daedalus-fourier PRs #15-#20).  Collapsed via a single
DEFINE_QPEL_WRAPPER macro on the libavcodec shim side so the diff
is uniform.

All recipe-table entries route AUTO to CPU NEON — no QPU shaders
for any qpel position other than mc20 yet.  Plumbing-only
NEON-to-NEON via the daedalus recipe layer; bit-exact against
the in-tree ff_*_h264_qpel8_*_neon path (each daedalus dispatch is
already bit-exact-gated by the corresponding fourier PR's test).

16x16 qpel tables ([0][...]) stay on the in-tree NEON.  daedalus
only exposes 8x8 today; 16x16 substitution can land once fourier
provides those variants.

Verified the patch applies cleanly on top of 0001-0011 against the
pinned upstream commit b57fbbe5 on hertz.
2026-05-25 14:05:56 +02:00
marfrit 57f73f1afb Merge pull request 'ffmpeg-v4l2-request-fourier: route H.264 chroma DC Hadamard through daedalus-fourier (0011)' (#100) from claude-noether/marfrit-packages:noether/h264-substitute-chroma-dc into main
Reviewed-on: marfrit/marfrit-packages#100
2026-05-25 12:03:08 +00:00
claude-noether d8aa3aae8d ffmpeg-v4l2-request-fourier: route H.264 chroma DC Hadamard through daedalus-fourier (0011)
Substitutes H264DSPContext.chroma_dc_dequant_idct in the 4:2:0 /
bit_depth=8 init path with a wrapper that composes the daedalus
chroma DC Hadamard primitive (daedalus-fourier PR #25) with the
qmul scaling FFmpeg's reference does in one fused function
(h264idct_template.c::ff_h264_chroma_dc_dequant_idct).

Algorithm per H.264 §8.5.11.1 / §8.5.11.2:
  1. Extract 4 DCs from the scattered positions in the per-MB
     coefficient buffer (stride=32, xStride=16)
  2. 2x2 Hadamard transform (daedalus primitive)
  3. qmul scale + >> 7, write back to original positions

Bit-exact against ff_h264_chroma_dc_dequant_idct_8_c. The Hadamard
itself is gated by the fourier PR #23 7-case test suite (including
the H·H = 4·I algebraic invariant), and the public-API parity
test added in PR #25 confirms the src/ symbol matches the test ref.

4:2:2 chroma stays on the in-tree ff_h264_chroma422_dc_dequant_idct_c
path — same chroma_format_idc<=1 gating shape as 0009 chroma deblock.

Pin bump: _daedalus_fourier_commit / DAEDALUS_FOURIER_COMMIT bumped
to b9f9ff2a (post-PR #25) so the build picks up the public
daedalus_h264_chroma_dc_hadamard_2x2 symbol.

Verified the patch applies cleanly on top of 0001-0010 against the
pinned upstream commit b57fbbe5 on hertz.
2026-05-25 13:39:54 +02:00
marfrit 1f58ff2b6b Merge pull request 'ffmpeg-v4l2-request-fourier: route H.264 luma intra deblock through daedalus-fourier (0010)' (#99) from claude-noether/marfrit-packages:noether/h264-substitute-deblock-intra into main
Reviewed-on: marfrit/marfrit-packages#99
2026-05-25 11:28:33 +00:00
claude-noether 45be17fbdf ffmpeg-v4l2-request-fourier: route H.264 luma intra deblock through daedalus-fourier (0010)
Adds the bS=4 intra-strength variants of the already-substituted
luma_v / luma_h deblock (0005, 0008).  Intra MBs and certain
inter-MB edges (4x4 transform boundaries inside an Intra_NxN
neighbour) force boundary strength to 4 per H.264 §8.7.2.1.

  H264DSPContext.v_loop_filter_luma_intra →
    daedalus_recipe_dispatch_h264_deblock_luma_v_intra
  H264DSPContext.h_loop_filter_luma_intra →
    daedalus_recipe_dispatch_h264_deblock_luma_h_intra

Both kernels landed in daedalus-fourier PR #11.  Recipe → CPU NEON
(no intra QPU shaders yet); plumbing-only NEON-to-NEON via daedalus.

Signature differs from bS<4: no tc0 argument.  Wrapper passes
daedalus_h264_deblock_meta with alpha/beta set; tc0[] is ignored by
the intra dispatch (bS=4 hardcodes the strength).

Chroma intra variants are deferred to a follow-up because the chroma
init has a 4:2:0 / 4:2:2 split (chroma_format_idc gating) — the
daedalus dispatch is 4:2:0-only and needs explicit conditional
substitution to avoid running on 4:2:2 chroma.

Verified the patch applies cleanly on top of 0001-0009 against the
pinned upstream commit b57fbbe5 on hertz.
2026-05-25 13:21:00 +02:00
marfrit 7b9bb9b2d0 Merge pull request 'ffmpeg-v4l2-request-fourier: route H.264 chroma v/h deblock through daedalus-fourier (0009)' (#98) from claude-noether/marfrit-packages:noether/h264-substitute-deblock-chroma into main
Reviewed-on: marfrit/marfrit-packages#98
2026-05-25 11:18:15 +00:00
claude-noether babb280410 ffmpeg-v4l2-request-fourier: route H.264 chroma v/h deblock through daedalus-fourier (0009)
Chroma siblings of 0005 (luma_v) and 0008 (luma_h).  Same
NEON-to-NEON pattern via the daedalus recipe layer:

  H264DSPContext.v_loop_filter_chroma →
    daedalus_recipe_dispatch_h264_deblock_chroma_v
  H264DSPContext.h_loop_filter_chroma →
    daedalus_recipe_dispatch_h264_deblock_chroma_h

Both kernels landed in daedalus-fourier PR #10.  Recipe table routes
AUTO to CPU NEON (no chroma QPU shaders yet), so this is plumbing-
only and stays bit-exact against the in-tree NEON.

Intra chroma (bS=4) loop filters remain on in-tree NEON;
daedalus_h264_deblock_meta covers the non-intra (bS<4) path.

Verified the patch applies cleanly on top of 0001-0008 against the
pinned upstream commit b57fbbe5 on hertz.  Wires the new patch into
both arch/PKGBUILD and debian/build-deb.sh.
2026-05-25 13:16:45 +02:00
marfrit 5b48d1c743 Merge pull request 'ffmpeg-v4l2-request-fourier: route H.264 luma-h deblock through daedalus-fourier (0008)' (#97) from claude-noether/marfrit-packages:noether/h264-substitute-deblock-luma-h into main
Reviewed-on: marfrit/marfrit-packages#97
2026-05-25 11:14:26 +00:00
claude-noether 624f83e877 ffmpeg-v4l2-request-fourier: route H.264 luma-h deblock through daedalus-fourier (0008)
Adds patch 0008 to the substitution arc, mirroring 0005's V variant
for H.264 non-intra bS<4 horizontal luma deblock.

  H264DSPContext.h_loop_filter_luma →
    daedalus_recipe_dispatch_h264_deblock_luma_h

The H kernel was added to daedalus-fourier in PR #9 (vendored
ff_h264_h_loop_filter_luma_neon, wired through the same CPU-dispatch
pattern as V).  Recipe table routes AUTO to CPU NEON (no QPU shader
for H yet), so this is a NEON-to-NEON substitution via the daedalus
recipe layer — same shape as 0005.

The libavcodec.so ctx remains no-QPU (daedalus_ctx_create_no_qpu),
matching the existing 0003/0004/0005/0007 patches.  Higher-cycle
QPU init waits for a feature-flag gating change in a separate PR.

Intra (bS=4) h_loop_filter_luma_intra stays on the in-tree NEON .S
code; daedalus_h264_deblock_meta covers the non-intra path only.
A follow-up can route intra once daedalus-fourier exposes the
intra-h dispatch (the kernel already exists internally per fourier
PR #11).

Wires the new patch into both arch/PKGBUILD and debian/build-deb.sh
sequences.  Verified the patch applies cleanly on top of 0001-0007
against the pinned upstream commit b57fbbe5 on hertz.
2026-05-25 13:10:05 +02:00
marfrit 902de73a02 Merge pull request 'mesa-panvk-bifrost r7: fix XFB store channel-extract for packed varyings (iter19)' (#96) from claude-noether/marfrit-packages:mesa-panvk-bifrost-r7 into main
Reviewed-on: marfrit/marfrit-packages#96
2026-05-25 09:22:47 +00:00
marfrit c14c22f942 mesa-panvk-bifrost r7: fix XFB store channel-extract for packed varyings (iter19)
Adds 0007-panvk-bifrost-xfb-component-base-fix.patch — eliminates a
reliable SIGSEGV in vkCreateGraphicsPipeline whenever an XFB-bound
vertex output is declared with non-zero `layout (component=N)`.

Surfaced by dEQP-VK.transform_feedback.simple.holes_vert (Mali-G52 r1
MC1, PAN_ARCH 7). Backtrace lands 11 frames into libvulkan_panfrost.so
called from vkt::TransformFeedback::TransformFeedbackHolesInstance::
iterate.

Root cause: iter17's lower_xfb_output_iter17 (and upstream
pan_nir_lower_xfb, which has the identical `// TODO`) computes the
source-channel mask as `mask << channel_idx`, where channel_idx is
the varying-location component (0..3) but src only contains channels
starting at nir_intrinsic_component(intr). For a scalar declared
component=2, the lowering computed `mask << 2` against a 1-component
src — out-of-range; the malformed nir_def segfaulted in downstream
NIR constant-folding (nir_constant_expressions.c::evaluate_*).

Fix translates channel_idx to source-channel space by subtracting
nir_intrinsic_component(intr) before shifting the mask, and replaces
the elided release-mode asserts with explicit release-mode guards
(closes the same elision class as the original bug).

Verified on PineTab2 (Mali-G52 r1 MC1, PAN_ARCH 7) against vulkan-cts
1.3.10.0:
  - holes_vert / holes_extra_draw_vert no longer SIGSEGV (now Fail
    on color-check; that is a separate iter20 finding).
  - basic_*: 36/36 Pass. depth_clip_*: 1 Pass + 4 NotSupported.
    lines_or_triangles*: 16 NotSupported. 0 Fail across the set.

Caveat (not regressions): max_output_components_64/_128/_256 were
never reached on the r5 sweep — watchdog killed transform_feedback
after the holes_vert crash. With this fix in place, they now run
and surface their own pre-existing coredumps, confirmed on shipped
r6 baseline too. iter20+ territory.

Phase 5 (2nd-model) review: APPROVE WITH CHANGES (non-blocking).
Changes applied: release-mode defensive guards on both preconditions
plus a dispatcher-side comment clarifying the i*2+j semantics.

Cross-refs:
  - ~/src/panvk-bifrost/iter19/phase{0,1,2,3}_holes_vert*.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 11:16:56 +02:00
marfrit b113e053f0 Merge pull request 'aish: package v0.1.0 for arch + debian' (#95) from claude-noether/marfrit-packages:noether/aish-v0.1.0-package into main
Reviewed-on: marfrit/marfrit-packages#95
2026-05-24 22:40:53 +00:00
marfrit 8ec4c57ad7 aish: package v0.1.0 for arch + debian
aish is an AI-augmented conversational shell in LuaJIT 2.x with FFI
bindings to libcurl, GNU readline, and libc — no C extensions, no
build step. Source-of-truth: git.reauktion.de/marfrit/aish, tag v0.1.0
(tarball sha256 9ebc3939e028832e39391ae33efacb5ec9bcd99d123cbc8ca1cd6ca9a640b5b5).

The arch and debian recipes mirror the lmcp pattern (pure-Lua any-arch
package, no makefile, install copies modules directly):

  arch/aish/PKGBUILD           — depends=(luajit readline curl)
  debian/aish/build-deb.sh     — pure dpkg-deb, SOURCE_DATE_EPOCH pinned
  debian/aish/debian/{control,changelog,copyright}

Install layout, matching what main.lua's script-dir-relative package.path
expects after the wrapper execs `luajit /usr/share/lua/5.1/aish/main.lua`:

  /usr/bin/aish                              ← bin/aish wrapper
  /usr/share/lua/5.1/aish/{main,broker,context,executor,history,
                           mcp,renderer,repl,router,safety,secrets}.lua
  /usr/share/lua/5.1/aish/ffi/{curl,libc,pty,readline}.lua
  /usr/share/lua/5.1/aish/vendor/dkjson.lua
  /usr/share/doc/aish/{README.md,LICENSE,examples/config.lua}

CI: two new jobs in .gitea/workflows/build.yml at the end of file.
aish-any chains needs:lmcp-debian (parallel-DAG with claude-his-any,
serialized via the shared arch-aarch64 runner — avoids needless wait
through the unrelated fourier stack). aish-debian chains needs:aish-any.
Both invoke the standard check-already-published.sh fast-skip on no-
change pushes.

Sonnet review (per feedback_reviews_use_sonnet.md + bugfix-process
step 4): no blockers. Folded in two findings before commit: switched
needs: from mpv-fourier-aarch64 to lmcp-debian (cleaner DAG, faster
cold-build wall clock), removed the dead Build-Depends: debhelper-
compat line from debian/aish/debian/control (build-deb.sh doesn't
use debhelper).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 00:35:01 +02:00
marfrit beb09c4863 Merge pull request 'mesa-panvk-bifrost: r5 -> r6 — advertise VK_EXT_legacy_dithering on Bifrost' (#94) from claude-noether/marfrit-packages:noether/mesa-panvk-bifrost-r6-legacy-dithering into main
Reviewed-on: marfrit/marfrit-packages#94
2026-05-24 22:28:58 +00:00
marfrit 7708744eb2 mesa-panvk-bifrost: r5 -> r6 — advertise VK_EXT_legacy_dithering on Bifrost
Backports Mesa main's unconditional flip of VK_EXT_legacy_dithering.
Pure-software composition; no new HW path. vk_render_pass already gates
on enabled_features.legacyDithering and panvk_vX_blend + pan_format
already plumb the dithered BLEND descriptor (BFMT2 table has MALI_BLEND_AU
encodings for RGB565/RGB5A1/RGBA4/RGB10A2 on PAN_ARCH 7). Our r5 base
just hadn't picked up the cherry-pick.

Phase 7 verify on ohm (PineTab2 / RK3566 / Mali-G52 r1 MC1) with a
locally-built r6 lib at /tmp/r6_test_lib/:

  Feature advertisement:
    r5: VK_EXT_legacy_dithering not in extension list, legacyDithering=false
    r6: VK_EXT_legacy_dithering rev 2 advertised, legacyDithering=true

  dEQP subsets (delta):
    dEQP-VK.api.*.legacy_dithering*    r5/r6 both:  2 P / 0 F (identical)
    dEQP-VK.renderpass.dithering.*     r5/r6 both:  0 P / 0 F / 94 NS (identical)
    dEQP-VK.renderpass2.dithering.*    r5/r6 both:  0 P / 0 F / 94 NS (identical)

  Net: zero regressions, advertisement-only delta as expected.

Second-model review (per bugfix-process step 4) traced the full code
path through vk_render_pass + panvk_vX_cmd_draw + panvk_vX_blend +
pan_format BFMT2. No interaction with our r1 nullDescriptor (disjoint
paths). Mesa upstream marks ext DONE for panvk in docs/features.txt.

ARM's own libmali r51p0 driver (BXODROIDN2PL, 2024-08) lists
VK_EXT_legacy_dithering in its Vulkan extension string table,
confirming the feature is shipped by ARM for Mali-G52-class hardware.

Follow-up out of scope: the 94 renderpass-dithering tests show as
NotSupported on both r5 and r6 — there's a separate panvk-side
prereq the dEQP harness checks (likely a specific format-feature
combination). Worth investigating in a future iteration.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 00:23:40 +02:00
marfrit 5d97cf15d6 Merge pull request 'chromium-fourier: bump to 148.0.7778.178, add Debian recipe' (#93) from claude-noether/marfrit-packages:noether/chromium-fourier-148 into main
Reviewed-on: marfrit/marfrit-packages#93
2026-05-24 19:46:13 +00:00
marfrit 58f67d4b2c chromium-fourier: bump to 148.0.7778.178, add Debian recipe
Arch: version 147→148, drop enable_nacl (removed upstream), fix
nv12-external-oes patch context for 148 (base/numerics/safe_conversions.h
include removed upstream). Header comment updated: native build fiction →
cross-compile reality.

Debian: new build-deb.sh that assembles .deb from pre-built artifacts
on CT 220 (data). Same binary artifacts as the Arch package, launcher at
/usr/bin/chromium-fourier (no Conflicts with stock chromium on Debian).

Both packages published to packages.reauktion.de:
- Arch: marfrit/aarch64/chromium-fourier 1:148.0.7778.178-1
- Debian: trixie+bookworm/main/arm64 chromium-fourier 1:148.0.7778.178-1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-24 09:45:30 +02:00
marfrit 685f85c22e Merge pull request 'mesa-panvk-bifrost: r4 -> r5 — advertise fragmentStoresAndAtomics on Bifrost (closes panvk-bifrost#2)' (#92) from claude-noether/marfrit-packages:noether/mesa-panvk-bifrost-r5 into main
Reviewed-on: marfrit/marfrit-packages#92
2026-05-23 12:19:10 +00:00
marfrit 6896853544 mesa-panvk-bifrost: r4 -> r5 — advertise fragmentStoresAndAtomics on Bifrost
Backports Mesa main's unconditional flip of .fragmentStoresAndAtomics
to true in src/panfrost/vulkan/panvk_vX_physical_device.c. Closes
the Dawn-WebGPU adapter rejection at PhysicalDeviceVk.cpp:250 that
caused brave-vulkan to fall back to the SwiftShader CPU adapter on
PineTab2/Mali-G52, per marfrit/panvk-bifrost#2.

Phase 7 verify on ohm (PineTab2, RK3566, Mali-G52 r1 MC1) with a
locally-built r5 lib installed to /tmp/r5_test_lib/:

  dEQP-VK.glsl.atomic_operations.*:
    r4:  48 pass /   0 fail / 992 NotSupported  (1040 total)
    r5:  80 pass /   0 fail / 960 NotSupported  (1040 total)
    delta: +32 newly-passing, zero new failures

  dEQP-VK.image.store.*:
    r4: 2772 pass /  0 fail / 238 NotSupported  (3010 total)
    r5: 2772 pass /  0 fail / 238 NotSupported  (3010 total)
    delta: identical (image.store is independent of the flag)

The disjunction with instance->force_enable_shader_atomics is kept as
a documented kill-switch even though the compiler folds it away —
it leaves the DRI option pan_force_enable_shader_atomics semantically
wired for future rebases or downstream debugging.

Patch reviewed via 2nd-model pass (per bugfix-process step 4):
recommended keeping the disjunction (applied), Bifrost-only-vs-unconditional
left unconditional to match upstream (applied), pre-ship CTS subset
(applied with results above).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 14:04:44 +02:00
marfrit fd56eca3cb Merge pull request 'mesa-panvk-bifrost{,-video}: fix url= to real Gitea repo' (#91) from claude-noether/marfrit-packages:noether/panvk-bifrost-url-fix into main
Reviewed-on: marfrit/marfrit-packages#91
2026-05-23 03:18:37 +00:00
marfrit 91022b390e mesa-panvk-bifrost{,-video}: fix url= to real Gitea repo
Both PKGBUILDs referenced url=https://github.com/marfrit/panvk-bifrost,
which was a hallucinated URL — no such repo existed. The campaign's
real source-of-truth home was just created at
https://git.reauktion.de/marfrit/panvk-bifrost (mfritsche, 2026-05-23).

Point both PKGBUILDs at the real URL so `pacman -Si` and any consumer
reading package metadata follows a working link.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 05:17:41 +02:00
marfrit b736dd0529 Merge pull request 'ffmpeg-v4l2-request-fourier: substitute H.264 qpel mc20 → daedalus-fourier' (#90) from claude-noether/marfrit-packages:noether/ffmpeg-fourier-qpel-mc20-daedalus into main
Reviewed-on: marfrit/marfrit-packages#90
2026-05-23 01:34:04 +00:00
claude-noether 0bfc4ab03e ffmpeg-v4l2-request-fourier: substitute H.264 qpel mc20 → daedalus-fourier
H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma horizontal
half-pel, 6-tap "put" — the canonical representative of the H.264
luma motion-compensation family) now dispatches through
daedalus_recipe_dispatch_h264_qpel_mc20 instead of
ff_put_h264_qpel8_mc20_neon.

Cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes the
4-cycle libavcodec.so substitution sequence:

  cycle 6 (PR #76)  H.264 IDCT 4x4         done
  cycle 7 (PR #85)  H.264 IDCT 8x8         done
  cycle 8 (PR #86)  H.264 luma-v deblock   done
  cycle 9 (this)    H.264 qpel mc20

Bumps daedalus-fourier pin d87239d → 209a421 (PR #2 — public API
gains daedalus_recipe_dispatch_h264_qpel_mc20 +
DAEDALUS_KERNEL_H264_QPEL_MC20).

Verdict per docs/k9_h264qpel_mc20.md: CPU NEON.  Per-block 7.6 ns at
131 Mblock/s gives 135× margin over 30 fps 1080p; QPU dispatch floor
at ~250 ns makes any V3D shader strictly worse.  Substitution is
plumbing-only — same daedalus_ctx_create_no_qpu pthread_once shape
the cycles 6/7/8 shims already own (kept SEPARATE from the H264DSP
shim's ctx because H264QPEL is its own libavcodec Makefile module
and link order does not guarantee a single .o owns the ctx symbol;
one extra ~µs init per process, paid lazily on first MC call).

Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the 16x16
size tier stay on the in-tree NEON .S code per the cycle-9 phase-1
rationale (mc20 8x8 is representative; remaining variants would
multiply recipe-lookup overhead without changing the substrate
verdict).

Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier
cycle 9 green; 10000/10000 random blocks bit-exact, M3 = 131 Mblock/s).

No SONAME change, no Depends change.  PKGREL 9 → 10.

Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
2026-05-23 03:32:29 +02:00
marfrit 8729c2db92 Merge pull request 'daedalus-v4l2 + daedalus-v4l2-dkms: bump to 872eec5 — PROTO_MAX_PAYLOAD 1 MiB (#20)' (#89) from claude-noether/marfrit-packages:noether/daedalus-bump-872eec5-1mib-payload into main
Reviewed-on: marfrit/marfrit-packages#89
2026-05-22 18:52:53 +00:00
marfrit d449ec1073 daedalus-v4l2 + daedalus-v4l2-dkms: bump to 872eec5 — PROTO_MAX_PAYLOAD 1 MiB (#20)
Picks up reauktion/daedalus-v4l2 PR #20 (closes #19): wire-protocol
cap DAEDALUS_PROTO_MAX_PAYLOAD raised from 64 KiB to 1 MiB.
DAEDALUS_MAX_BITSTREAM follows; daedalus_fill_output_fmt now reports
OUTPUT_MPLANE sizeimage = ~1 MiB.

Fixes the Firefox YouTube avc1 SW-fallback observed on higgs when
any H.264 slice exceeded 64 KiB (routine on 720p+ streams).
libva-v4l2-request-fourier's S_FMT-driven OUTPUT-pool resize was
clamping back to 65484 and Firefox lost the slice; now the kernel
honours the larger sizeimage.

Both packages bumped to 0.1.0+r45+g872eec5-1:

  - daedalus-v4l2 (daemon): r43 -> r45.  Daemon-side allocations
    are dynamic, so the only growth is one ~1 MiB read buffer per
    daemon process at startup.
  - daedalus-v4l2-dkms (kernel module): r33 -> r45.  Skips the
    daemon-only bumps r37/r39/r41/r43 (no kernel/include change in
    that range) and lands the PROTO_MAX_PAYLOAD bump.

LOCK-STEP INSTALL REQUIRED: effective cap is min(kernel, daemon).
A stale kernel with a new daemon (or vice versa) still rejects
>64 KiB payloads.  apt/pacman should pick both up in one
transaction since they share the same upstream pin.

Wire-protocol value-only change in include/daedalus_v4l2_proto.h;
struct layout unchanged.  DAEDALUS_PROTO_VERSION stays at 0.
2026-05-22 20:50:04 +02:00
marfrit 9d30c34be9 Merge pull request 'daedalus-v4l2: 6e6dfa1 -> 1d8f5af — pause-time tiny-bitstream filter (#18)' (#88) from claude-noether/marfrit-packages:noether/daedalus-bump-1d8f5af-pause-filter into main
Reviewed-on: marfrit/marfrit-packages#88
2026-05-22 16:20:14 +00:00
marfrit 1ca18ac130 daedalus-v4l2: 6e6dfa1 -> 1d8f5af — pause-time tiny-bitstream filter (#18)
Picks up reauktion/daedalus-v4l2 PR #18 (closes #17): daemon drops
degenerate (<4 byte) bitstreams at REQ_DECODE entry instead of
letting avcodec_send_packet emit AVERROR_INVALIDDATA, replies
RESP_FRAME NO_FRAME so libva's V4L2 surface pool stays alive.

Fixes the Firefox YouTube avc1 pause→resume regression observed on
higgs: libva-v4l2-request-fourier flushes a 3-byte stub into
OUTPUT_MPLANE at the pause boundary; the old daemon path turned
that into a decode failure, Firefox marked H.264-via-VAAPI as
broken for the session, and routed every subsequent frame to
libmozavcodec SW.  After this bump the daemon logs 'tiny bitstream
3 bytes — dropping as no-op' and the next real REQ_DECODE
proceeds normally.

Wire protocol unchanged.  daedalus-v4l2-dkms bump not needed.
2026-05-22 18:16:33 +02:00
marfrit cf9eef6cfa Merge pull request 'ffmpeg-v4l2-request-fourier: restore AV_CODEC_FLAG_LOW_DELAY in H.264 decoder' (#87) from claude-noether/marfrit-packages:noether/ffmpeg-fourier-restore-low-delay into main
Reviewed-on: marfrit/marfrit-packages#87
2026-05-22 14:27:43 +00:00
marfrit 5c69460722 ffmpeg-v4l2-request-fourier: restore AV_CODEC_FLAG_LOW_DELAY in H.264 decoder
FFmpeg 8.x dropped the H.264 decoder's low_delay code path —
AV_CODEC_FLAG_LOW_DELAY no longer prevents h264_select_output_frame
from running the display-order DPB output queue.  The daedalus-v4l2
daemon's `ctx->flags |= AV_CODEC_FLAG_LOW_DELAY` at
daemon/src/decoder.c:202 has been a silent no-op since the SONAME
61→62 jump landed in reauktion/daedalus-v4l2 PR #16; on Firefox
YouTube this re-introduced the 2-1-4-3 B-frame pair-swap that PR
#12's daemon flag was supposed to prevent.

Fix lives in libavcodec, not the daemon: restore the documented
LOW_DELAY semantics so the daemon (and any other V4L2-stateless-
style consumer) keeps the one-frame-per-send_packet decode-order
output contract it already declares.

## Patch

0006-h264-restore-low-delay.patch touches libavcodec/h264_slice.c:

- h264_select_output_frame: early-exit when LOW_DELAY is set.
  Emit the just-decoded picture as next_output_pic, mirror the
  corruption / recovery-point tracking the main path performs,
  skip delayed_pic[] / POC reorder machinery entirely.

- h264_field_start: suppress the SPS-driven
  `has_b_frames = sps->num_reorder_frames` clobber when LOW_DELAY
  is set.  Without this the per-slice bitstream_restriction_flag
  re-pickup would reintroduce a nonzero reorder buffer mid-stream
  even after the daemon set has_b_frames=0 at avcodec_open2.

## Why not daemon-side

A daemon SPS-rewrite (`num_reorder_frames=0`) was considered but
rejected: it works only for the daemon's reconstructed SPS NAL,
not for any in-band SPS the daemon dlopens libavformat to parse
in other code paths.  Restoring documented FFmpeg flag semantics
is the smaller, more durable change and keeps the daemon
interface stable.

## Packaging

- PKGREL/pkgrel bump to 9.
- No new build-deps, no Depends change.
- Substitution arc cycles 6/7/8 unchanged.

## Refs

- reauktion/daedalus-v4l2#11 / #12 (LOW_DELAY half-measure on
  daemon side, originally landed against FFmpeg 7.x).
- daemon/src/decoder.c:202 (`ctx->flags |= AV_CODEC_FLAG_LOW_DELAY`
  for H.264 only — unchanged, but now actually has effect again).
2026-05-22 14:20:37 +02:00
marfrit d11a52405d Merge pull request 'ffmpeg-v4l2-request-fourier: substitute H.264 luma-v deblock → daedalus-fourier' (#86) from claude-noether/marfrit-packages:noether/ffmpeg-fourier-deblock-luma-v-daedalus into main
Reviewed-on: marfrit/marfrit-packages#86
2026-05-22 10:29:09 +00:00
marfrit 29e0852d11 ffmpeg-v4l2-request-fourier: substitute H.264 luma-v deblock → daedalus-fourier
Cycle 8 of the libavcodec.so substitution arc (reauktion/daedalus-v4l2#11
step 2).  H264DSPContext.v_loop_filter_luma — non-intra bS<4 vertical
luma deblock, called per macroblock-row edge from the slice deblock
loop in libavcodec/h264_loopfilter.c — now dispatches through
daedalus_recipe_dispatch_h264_deblock_luma_v instead of
ff_h264_v_loop_filter_luma_neon.

## What

- Add 0005-h264-deblock-luma-v-daedalus-fourier.patch (in both arch/
  and debian/ ffmpeg-v4l2-request-fourier/).  Extends
  libavcodec/aarch64/h264_idct_daedalus.c with
  ff_h264_v_loop_filter_luma_daedalus (constructs a
  daedalus_h264_deblock_meta from FFmpeg's (alpha, beta, tc0[4]) and
  calls daedalus_recipe_dispatch_h264_deblock_luma_v with n_edges=1).
  Patches libavcodec/aarch64/h264dsp_init_aarch64.c to wire
  c->v_loop_filter_luma to the new shim.
- arch/PKGBUILD + debian/build-deb.sh: append patch + bump pkgrel/PKGREL
  to 8.
- No new build-deps, no Depends change, no daedalus-fourier rev — the
  d87239d pin already exposes daedalus_recipe_dispatch_h264_deblock_luma_v.

## Why

Cycle 8 is marked "CPU primary; QPU opportunistic" in the daedalus-
fourier API docstring.  Per the hybrid substrate philosophy
("if there's a coprocessor, use it") we eventually want the QPU
opportunism active here.  But the libavcodec.so context is
process-global and shared with cycles 6/7 via pthread_once, and it
uses daedalus_ctx_create_no_qpu deliberately to avoid implicit
Vulkan init in arbitrary host processes (Firefox content, mpv-fourier,
ffmpeg-fourier CLI, ...).  Switching to daedalus_ctx_create here
without a feature flag would be a footgun.

So cycle 8 lands as plumbing-only NEON-by-recipe substitution for
now; opportunistic QPU enablement is a separate follow-up that adds
a DAEDALUS_FOURIER_ENABLE_QPU env var or equivalent.

## Scope NOT covered

- Intra (bS=4) loop filter c->v_loop_filter_luma_intra — daedalus's
  daedalus_h264_deblock_meta only covers the non-intra path.
- Horizontal-edge variant c->h_loop_filter_luma — separate kernel
  (not yet in daedalus-fourier API).
- Chroma loop filters — separate kernels.
- Bulk batching — single-edge dispatch wastes the kernel's n_edges>1
  amortization.  Same caveat as cycles 6/7; follow-up.
- QPU opportunism — see "Why" above.

## SONAME

Unchanged.  libavcodec.so.62 / libavformat.so.62 / libavutil.so.60.

## Refs

- reauktion/daedalus-v4l2 issue #11: reauktion/daedalus-v4l2#11
- marfrit-packages PR #76 (cycle 6 IDCT 4×4)
- marfrit-packages PR #85 (cycle 7 IDCT 8×8)
- marfrit/daedalus-fourier cycle 8 close (deblock luma-v NEON green)
2026-05-22 12:17:14 +02:00
marfrit 510a31622c Merge pull request 'ffmpeg-v4l2-request-fourier: substitute H.264 IDCT 8×8 → daedalus-fourier' (#85) from claude-noether/marfrit-packages:noether/ffmpeg-fourier-idct8-daedalus into main
Reviewed-on: marfrit/marfrit-packages#85
2026-05-22 08:32:15 +00:00
marfrit db9ae16da9 Merge pull request 'mesa-panvk-bifrost-video: regenerate 0005 patch from POST-review snapshot' (#84) from claude-noether/marfrit-packages:noether/mesa-panvk-bifrost-video-retrigger into main
Reviewed-on: marfrit/marfrit-packages#84
2026-05-22 08:20:34 +00:00
marfrit 493c762967 ffmpeg-v4l2-request-fourier: substitute H.264 IDCT 8×8 → daedalus-fourier
Cycle 7 of the libavcodec.so substitution arc (reauktion/daedalus-v4l2#11
step 2).  H264DSPContext.idct8_add — called per 8×8 block from the
High-profile intra-8×8-DCT decode path in libavcodec/h264_mb.c — now
dispatches through daedalus_recipe_dispatch_h264_idct8 instead of
ff_h264_idct8_add_neon.

## What

- Add 0004-h264-idct8-daedalus-fourier.patch (in both arch/ and debian/
  ffmpeg-v4l2-request-fourier/).  Extends libavcodec/aarch64/
  h264_idct_daedalus.c (introduced by 0003) with ff_h264_idct8_add_daedalus
  and a daedalus_recipe_dispatch_h264_idct8 call; patches
  libavcodec/aarch64/h264dsp_init_aarch64.c to wire c->idct8_add to
  the new shim.
- arch/PKGBUILD + debian/build-deb.sh: append the new patch to the
  apply list; bump pkgrel/PKGREL to 7.
- No new build-deps, no Depends change, no daedalus-fourier rev — the
  d87239d pin already exposes daedalus_recipe_dispatch_h264_idct8.

## Why

The recipe layer picks the substrate; for cycle 7 (H.264 IDCT 8×8)
the recipe is CPU NEON, so this is effectively a NEON-to-NEON
substitution layered on top of cycle 6.  Production validation of
cycle 6 on higgs Firefox YouTube: 3040 frames decoded cleanly,
avg_decode_us=3388 (no regression vs the pre-substitution ~4 ms
baseline).  Cycle 7 inherits the same shim's pthread_once context.

Bit-exact against ff_h264_idct8_add_neon (daedalus-fourier cycle 7
green; FFmpeg 8×8 block storage block[r + 8*c] matches daedalus
column-major convention).

## Scope NOT covered (deferred)

- Bulk c->idct8_add4 (inter 8×8-DCT macroblocks) stays on the
  in-tree NEON .S code; batched substitution with n_blocks>1 lands
  later alongside the cycle-6 bulk-paths work.
- High-bit-depth (10-bit) path untouched.
- Cycles 8/9 — separate PRs.

## SONAME

Unchanged.  libavcodec.so.62 / libavformat.so.62 / libavutil.so.60.

## Refs

- reauktion/daedalus-v4l2 issue #11 (substitution arc): reauktion/daedalus-v4l2#11
- marfrit-packages PR #76 (cycle 6 IDCT 4×4)
- marfrit-packages PR #78 (libxml2 ABI-skew workaround)
- marfrit/daedalus-fourier cycle 7 close (H.264 IDCT 8×8 NEON green)
2026-05-22 10:20:27 +02:00
marfrit 7ecbcb3c1b mesa-panvk-bifrost-video: regenerate 0005 patch from POST-review snapshot
The original 0005 patch was generated from the pre-Phase-5-review source
snapshot (phase5_review_input_2026-05-21.tgz), missing the four
load-bearing review fixes that landed in the post-review snapshot:
  - probe_hantro gate on KHR_video_* extension advertisement
  - per-session ts_counter (was process-global static)
  - panvk_v4l2_session_finish full unwind (munmap + STREAMOFF + REQBUFS=0)
  - MIN2(rb.count, 18) clamp on num_*_buffers

Run #162 (job 17032) failed in prepare() because the PKGBUILD sanity
check 'grep -q "KHR_video_queue = PAN_ARCH < 9 && panvk_v4l2_probe_hantro()"'
didn't match the actual patched output (which still had the pre-review
'KHR_video_queue = PAN_ARCH < 9,').

This patch (regenerated from phase5_post_review_2026-05-21.tgz) carries
all four review fixes. Validated locally: vanilla mesa-26.0.6 + r1..r4 +
this patch reproduces prepare()-OK byte-for-byte.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 10:18:11 +02:00
marfrit 360e8eb6bf Merge pull request 'mesa-panvk-bifrost-video: r1-r4 patches as real files (symlinks broke CI)' (#83) from claude-noether/marfrit-packages:noether/mesa-panvk-bifrost-video-retrigger into main
Reviewed-on: marfrit/marfrit-packages#83
2026-05-22 07:55:59 +00:00
marfrit 4db64917bc mesa-panvk-bifrost-video: r1-r4 patches as real files (symlinks broke CI)
The original PR #79 used symlinks for 0001..0004 patches (pointing into
../mesa-panvk-bifrost/) to avoid drift between siblings. CI's
"cp -r arch/mesa-panvk-bifrost-video /tmp/build-..." preserves the
symlinks, but the destination /tmp/build-... has no sibling dir to
resolve them against, so makepkg errors with:

  ==> ERROR: 0001-panvk-expose-robustness2-nullDescriptor-bifrost.patch
             was not found in the build directory and is not a URL.

Each Arch PKGBUILD owns its source files per convention; the
duplication risk is low because r1..r4 are closed-release patches.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 09:49:59 +02:00
marfrit 6288536223 Merge pull request 'ci: fix duplicate run: key in build.yml wipe-secrets step (unblocks all builds since 2026-05-21)' (#82) from claude-noether/marfrit-packages:noether/fix-build-yaml-duplicate-run into main
Reviewed-on: marfrit/marfrit-packages#82
2026-05-22 07:30:37 +00:00
claude-noether 09d8813507 ci: fix duplicate run: key in build.yml wipe-secrets step
PR #79 (6ee8f2748, mesa-panvk-bifrost-video) added a second `run:`
mapping key on the next line of the same step:

    - name: wipe secrets
      if: always()
      run: rm -f /root/repo_pass /root/.ssh/id_ed25519
      run: rm -f /root/.ssh/id_ed25519_hertz    ← duplicate `run:` key

YAML doesn't allow two mappings with the same key in one node, so
Gitea's workflow parser rejected the entire file:

  actions/workflows.go:124:DetectWorkflows() [W]
    ignore invalid workflow "build.yml": yaml: unmarshal errors:
      line 1423: mapping key "run" already defined at line 1422

Result: every push to main since 6ee8f2748 (2026-05-21 23:14 CEST)
silently failed to enqueue ANY action run.  PR #80's "re-trigger by
README touch" had no chance — workflow file was invalid before #80
even existed.  Runs #161-163 do not exist; #160 (pre-#79) is the
last successful enqueue.

Fix: merge the two single-line `run:` invocations into one literal
block.  Functionally identical, YAML-valid.

Post-merge: workflow file becomes valid again, new push to main
triggers a fresh build run covering the backlog (#79's
mesa-panvk-bifrost-video build that #80 wanted re-triggered).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 09:15:18 +02:00
marfrit 8a3186b53c Merge pull request 'mesa-panvk-bifrost-video: re-trigger Actions for PR #79' (#80) from claude-noether/marfrit-packages:noether/mesa-panvk-bifrost-video-retrigger into main
Reviewed-on: marfrit/marfrit-packages#80
2026-05-22 06:32:42 +00:00
marfrit b81e2251c2 mesa-panvk-bifrost-video: re-trigger Actions for PR #79
The merge commit for PR #79 (e7cc22e42) did not auto-fire the
Gitea Actions workflow despite touching paths matched by the
build.yml filter (arch/** + .gitea/workflows/**). No run row
exists between #160 (PR #78 merge) and now. This README touch
is a no-op content change to force a fresh workflow_dispatch
through the standard push trigger.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 07:51:01 +02:00
marfrit e7cc22e42d Merge pull request 'mesa-panvk-bifrost-video: sibling package adding VK_KHR_video_decode_h264' (#79) from claude-noether/marfrit-packages:noether/mesa-panvk-bifrost-video-phase8 into main
Reviewed-on: marfrit/marfrit-packages#79
2026-05-21 21:33:53 +00:00
marfrit 62b6b0a700 Merge pull request 'ffmpeg-v4l2-request-fourier (debian): drop --enable-libxml2 (runner SONAME skew)' (#78) from claude-noether/marfrit-packages:noether/ffmpeg-fourier-drop-libxml2 into main
Reviewed-on: marfrit/marfrit-packages#78
2026-05-21 21:24:07 +00:00
marfrit a8f4a70887 ffmpeg-v4l2-request-fourier (debian): drop --enable-libxml2 (runner SONAME skew)
The Gitea debian-aarch64 runner has been upgraded past Debian trixie
and now ships libxml2 ≥ 2.14 (SONAME 16) while higgs (and any other
trixie target) still has libxml2 2.12 (SONAME 2).  -5 built cleanly,
but on higgs the daedalus-v4l2 daemon's dlopen of libavformat.so.62
fails:

    dlopen(libavformat.so.62): libxml2.so.16:
    cannot open shared object file: No such file or directory

Drop --enable-libxml2 from the Debian configure invocation; remove
the libxml2 entry from Depends; remove libxml2-dev from the CI
build-deps.  FFmpeg's libxml2-backed DASH demuxer is unused on the
Fourier fleet — daedalus-v4l2 daemon feeds AVPackets straight to
avcodec_send_packet (no demux); mpv-fourier uses ytdlp + mpv's own
stream code; firefox-fourier uses gecko-media's DASH demux.

Bumps PKGREL 5 → 6.  No source code or substitution-patch change.
Mirrors the libva trixie/runner ABI-skew workaround pattern
(marfrit-packages PR #62).

Arch PKGBUILD unaffected — Arch runner + Arch consumers both
rolling, libxml2 SONAMEs match.

After this lands, re-deploy on higgs via:
    sudo apt update && sudo apt install -y ffmpeg-v4l2-request-fourier
    sudo systemctl restart daedalus-v4l2
2026-05-21 23:18:00 +02:00
marfrit 6ee8f2748e mesa-panvk-bifrost-video: sibling package adding VK_KHR_video_decode_h264
panvk-bifrost-video campaign close. Phase 4 byte-exact validated
2026-05-21 on RK3566/PineTab2 (Mali-G52 r1 MC1 + hantro VPU): 48/48
unique BBB display frames decoded by this driver are byte-identical
to ffmpeg+libva-v4l2-request-fourier on the same hantro hardware
(frame 42 Y md5 = 54b9b396e6cd377256eb4bce0efc0bed both ways).
Phase 5 second-model review passed; load-bearing findings applied.

Co-installs at /usr/lib/panvk-bifrost-video/ parallel to the r4
sibling at /usr/lib/panvk-bifrost/; opt-in via VK_ICD_FILENAMES.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-21 23:14:01 +02:00
marfrit 711a921e66 Merge pull request 'ffmpeg-v4l2-request-fourier: PKGREL 3 → 5 (force rebuild past orphan -4 .deb)' (#77) from claude-noether/marfrit-packages:noether/ffmpeg-fourier-debian-pkgrel-5 into main
Reviewed-on: marfrit/marfrit-packages#77
2026-05-21 20:36:15 +00:00
marfrit 9bf97fdb49 ffmpeg-v4l2-request-fourier: PKGREL 3 → 5 (force rebuild past orphan -4 .deb)
PR #76 (H.264 IDCT 4×4 daedalus-fourier substitution) was merged but
the resulting .deb was not actually built: an orphan
ffmpeg-v4l2-request-fourier_8.1+rfourier+gb57fbbe-4_arm64.deb (dated
2026-05-19, no matching source commit in main) sat in the apt pool.
.gitea/scripts/check-already-published.sh's debian branch compares
`dpkg --compare-versions $pool_ver ge $source_full` — pool -4
≥ source -3, so CI's skip-check emitted skip=1 and short-circuited
the build.  The ffmpeg-v4l2-request-debian Action reported success
without actually publishing.

Bump source PKGREL past -4 so the next CI run sees source >= pool
and proceeds to build + publish.

No source code change beyond PKGREL + changelog.  Arch side
unaffected (its skip-check is exact-URL-match, not pool-head-ge).
2026-05-21 22:17:00 +02:00
marfrit a536e20218 Merge pull request 'ffmpeg-v4l2-request-fourier: substitute H.264 IDCT 4×4 → daedalus-fourier' (#76) from claude-noether/marfrit-packages:noether/ffmpeg-fourier-idct4-daedalus into main
Reviewed-on: marfrit/marfrit-packages#76
2026-05-21 19:57:31 +00:00
marfrit a1dba5f630 Merge remote-tracking branch 'origin/main' into noether/ffmpeg-fourier-idct4-daedalus 2026-05-21 21:56:41 +02:00
marfrit 88a65cb6d0 CI: add cmake / ninja-build / libvulkan-dev / glslang-tools to ffmpeg-debian deps
The ffmpeg-v4l2-request-debian job now needs to build daedalus-fourier
into a temp prefix before configuring FFmpeg (substitution patch
0003-h264-idct4-daedalus-fourier.patch links libdaedalus_core.a into
libavcodec.so).  Mirror the build-deps the daedalus-v4l2-debian job
already declared for the same reason.

No-op on Arch — makepkg --syncdeps auto-installs cmake/ninja/
vulkan-headers from the PKGBUILD makedepends.
2026-05-21 21:47:48 +02:00
marfrit e641d679d3 ffmpeg-v4l2-request-fourier: substitute H.264 IDCT 4×4 → daedalus-fourier
First cycle of the libavcodec.so substitution arc (reauktion/daedalus-v4l2#11
step 2).  H264DSPContext.idct_add — called per 4×4 block from the
intra-4×4 decode path in libavcodec/h264_mb.c — now dispatches through
daedalus_recipe_dispatch_h264_idct4 instead of ff_h264_idct_add_neon.

## What

- Add 0003-h264-idct4-daedalus-fourier.patch (in both arch/ and
  debian/ ffmpeg-v4l2-request-fourier/).  Creates
  libavcodec/aarch64/h264_idct_daedalus.c (ff_h264_idct_add_daedalus
  shim + lazy pthread_once context init via
  daedalus_ctx_create_no_qpu), patches
  libavcodec/aarch64/h264dsp_init_aarch64.c to wire c->idct_add to
  the shim, adds the new .o to libavcodec/aarch64/Makefile.
- arch/PKGBUILD + debian/build-deb.sh: fetch + build
  daedalus-fourier (pinned at d87239d — lockstep with the
  daedalus-v4l2 daemon's inline build) with
  -DCMAKE_POSITION_INDEPENDENT_CODE=ON into a per-build temp prefix,
  then pass --extra-cflags=-I.../include --extra-ldflags=-L.../lib
  --extra-libs="-ldaedalus_core -lvulkan -lpthread" to FFmpeg
  configure.  daedalus_core.a is static-linked into libavcodec.so.62.
- debian/control Depends gains libvulkan1 (daedalus_core PUBLIC-links
  Vulkan::Vulkan for the queryable QPU substrate; the no-QPU
  constructor still works at runtime but the loader needs
  libvulkan.so.1 present to dlopen libavcodec.so.62).
- arch/PKGBUILD depends gains vulkan-icd-loader, makedepends gains
  cmake / ninja / vulkan-headers.

## Why

The recipe layer picks the substrate; for cycle 6 (H.264 IDCT 4×4)
the recipe is CPU NEON, so this is effectively a NEON-to-NEON
substitution with one extra dispatch call and recipe-table lookup.
The point of this first cycle isn't perf wins — it's plumbing.  Once
the path is wired and stable, follow-up patches batch through the
bulk paths (idct_add16 / idct_add16intra / idct_add8) and stack
cycles 7/8/9 (IDCT 8×8, luma-v deblock, qpel mc20).

Bit-exact against ff_h264_idct_add_neon (daedalus-fourier cycle 6
green; FFmpeg's 4×4 block storage matches daedalus's column-major
convention).

## Scope NOT covered

- Bulk paths (idct_add16 / idct_add16intra / idct_add8) — most IDCT
  4×4 calls in real H.264 streams go through these, not the per-
  block c->idct_add path; intra-4×4-only macroblocks are a minority.
  Batched substitution lands in a follow-up.
- High-bit-depth (10-bit) path — not touched; 8-bit only.
- Cycles 7/8/9 — separate PRs.

## SONAME

Unchanged.  libavcodec.so.62 / libavformat.so.62 / libavutil.so.60.
No daedalus-v4l2-dkms or daedalus-v4l2 bump required.

## Refs

- reauktion/daedalus-v4l2 issue #11 (substitution arc): reauktion/daedalus-v4l2#11
- marfrit/daedalus-fourier cycle 6 close (H.264 IDCT 4×4 NEON green)
2026-05-21 21:44:35 +02:00
marfrit 877238bd1b Merge pull request 'daedalus-v4l2: 3bc0da1 -> 6e6dfa1 — dlopen Kwiboo soname 62 + CI build-deps swap' (#75) from claude-noether/marfrit-packages:noether/daedalus-bump-6e6dfa1-soname62 into main
Reviewed-on: marfrit/marfrit-packages#75
2026-05-21 19:26:39 +00:00
claude-noether 27617e4cb0 daedalus-v4l2: 3bc0da1 -> 6e6dfa1 — dlopen Kwiboo soname 62 (#16) 2026-05-21 21:24:03 +02:00
marfrit a2daab1b28 Merge pull request 'daedalus-v4l2: 77e14e5 -> 3bc0da1 — decode_us + periodic stats' (#74) from claude-noether/marfrit-packages:noether/daedalus-bump-3bc0da1 into main
Reviewed-on: marfrit/marfrit-packages#74
2026-05-21 18:50:46 +00:00
claude-noether 9146e83710 daedalus-v4l2: 77e14e5 -> 3bc0da1 — decode_us + periodic stats (#15) 2026-05-21 20:29:07 +02:00
marfrit abf8fb3077 Merge pull request 'ci: add libvulkan-dev + glslang-tools for daedalus-fourier build dep' (#73) from claude-noether/marfrit-packages:noether/ci-fourier-build-deps into main
Reviewed-on: marfrit/marfrit-packages#73
2026-05-21 18:05:59 +00:00
claude-noether 1414dfeac2 .gitea/workflows: add libvulkan-dev + glslang-tools to daedalus-v4l2 Debian build deps
The daedalus-v4l2 build-deb.sh (post marfrit-packages#72) now fetches
+ cmake-builds daedalus-fourier into a per-build temp prefix before
building the daemon, so the static-archive can be linked in.
daedalus-fourier's CMakeLists requires Vulkan headers and glslangValidator
(for SPIR-V compilation of the .comp compute shaders).  Without them
the configure step on the debian-aarch64 runner fails with:

  CMake Error at FindPackageHandleStandardArgs.cmake:233 (message):
    Could NOT find Vulkan (missing: Vulkan_LIBRARY Vulkan_INCLUDE_DIR)

(Observed on Gitea Actions run 1056.)

Add `libvulkan-dev` and `glslang-tools` to the apt-get install line so
the in-build daedalus-fourier compile succeeds and the daemon can link.
2026-05-21 19:58:19 +02:00
marfrit 41c1e0b6b9 Merge pull request 'daedalus-v4l2: 5d8b436 -> 77e14e5 — #12 (LOW_DELAY) + #13 (daedalus-fourier linkage)' (#72) from claude-noether/marfrit-packages:noether/daedalus-bump-77e14e5-with-fourier into main
Reviewed-on: marfrit/marfrit-packages#72
2026-05-21 17:15:12 +00:00
claude-noether c9a4b82f2c daedalus-v4l2: 5d8b436 -> 77e14e5 — picks up #12 (LOW_DELAY) + #13 (daedalus-fourier linkage)
Daemon-only bump (no daedalus-v4l2-dkms change needed; PROTO_VERSION
stays at 0).

#12 (LOW_DELAY half-measure): daemon sets AV_CODEC_FLAG_LOW_DELAY on
the H.264 AVCodecContext so libavcodec emits frames in decode order
~99% of the time (a few stragglers at GOP boundaries when the
stream's SPS num_reorder_frames overrides the flag).  Visible
improvement vs the 2-1-4-3 pair-swap on Firefox + mpv playback;
not the permanent fix — see daedalus-v4l2#11 for the architectural
plan to substitute daedalus-fourier kernels for libavcodec's
pixel math one cycle at a time.

#13 (daedalus-fourier linkage): daemon now pkg-config-links against
the daedalus-fourier kernel library (marfrit/daedalus-fourier) and
logs substrate availability at startup.  No kernels dispatched yet
— this is the build-time foundation for the substitution work.

build-deb.sh updated to fetch + build + install daedalus-fourier
(pinned at d87239d, marfrit/daedalus-fourier PR #1) into a per-
build temp prefix before invoking the daemon's cmake, exposing it
via PKG_CONFIG_PATH.  Static-linked, so the resulting .deb has no
new runtime deps.  Requires libvulkan-dev + glslang-tools on the
CI runner.

Arch PKGBUILD bumped to the same upstream commit but Arch packaging
for daedalus-fourier itself is a follow-up; until that lands the
Arch build expects daedalus-fourier installed by the user (AUR-style).
Debian-side is end-to-end self-contained via build-deb.sh.

Refs:
  * reauktion/daedalus-v4l2#12
  * reauktion/daedalus-v4l2#13
  * reauktion/daedalus-v4l2#11
  * marfrit/daedalus-fourier#1
2026-05-21 18:39:22 +02:00
marfrit 736b6da176 Merge pull request 'daedalus-v4l2{,-dkms}: 79256dc/6ffe92b -> 5d8b436 — revert parking design' (#71) from claude-noether/marfrit-packages:noether/daedalus-revert-bump-5d8b436 into main
Reviewed-on: marfrit/marfrit-packages#71
2026-05-21 14:54:18 +00:00
claude-noether 34972ae9c1 daedalus-v4l2{,-dkms}: 79256dc/6ffe92b -> 5d8b436 — revert parking design
Lock-step downgrade of both packages to the revert tip of
daedalus-v4l2 (PR #10 closed PRs #7 + #8).  After
0.1.0+r28+g79256dc-1 / 0.1.0+r30+g6ffe92b-1 landed in production,
mpv (--hwdec=vaapi-copy) failed pre-playing with "Unable to dequeue
buffer: Resource temporarily unavailable" because the daemon
parked CAPTURE buffers waiting for libavcodec's display-order
reorder, violating libva's V4L2 stateless 1:1 contract.  See
daedalus-v4l2#9 for the diagnostic, #10 for the revert PR.

DAEDALUS_PROTO_VERSION drops 1 → 0; install both .debs in the same
apt transaction.  Userspace ABI returns to the f0d4186-equivalent
behaviour, plus PR #4 (cosmetic H.264 menu controls).  The
daedalus-v4l2-dkms #64 multi-kernel postinst behaviour stays in
build-deb.sh.

Visible regression: H.264 B-frame streams in Firefox return to the
"2 1 4 3 6 5" pair-swap visual.  Proper fix (concurrent in-flight
requests in daemon + display-order reorder moved into libva-v4l2-
request-fourier) tracked at daedalus-v4l2#11.

Refs:
  * reauktion/daedalus-v4l2#9
  * reauktion/daedalus-v4l2#10  (merged)
  * reauktion/daedalus-v4l2#11
2026-05-21 15:42:03 +02:00
marfrit a9f1b833b9 Merge pull request 'mesa-panvk-bifrost: r3 -> r4 — iter17 XFB primitive decomposition' (#70) from claude-noether/marfrit-packages:noether/mesa-panvk-bifrost-r4-iter17-xfb-decomp into main
Reviewed-on: marfrit/marfrit-packages#70
2026-05-21 12:18:23 +00:00
marfrit 83e8eca56d mesa-panvk-bifrost: r3 -> r4 — iter17 XFB primitive decomposition
iter17 closes the 162 winding_* CTS failures from iter15's baseline by
replacing the upstream pan_nir_lower_xfb call with a panvk-specific NIR
pass (panvk_per_arch(nir_lower_xfb)) that handles per-primitive
decomposition for non-LIST topologies (LINE_STRIP, TRIANGLE_STRIP,
TRIANGLE_FAN, and the four _WITH_ADJACENCY variants).

Topology + per-instance output vertex count are threaded as new sysvals
(vs.xfb_topology + vs.xfb_output_count) so the NIR pass can dispatch
per-topology at runtime without compiling 7+ shader variants.

dEQP-VK.transform_feedback.simple.* result (133596 cases total):
                  iter15 baseline  ->  iter17
  Pass:             796               958   (+162)
  Fail:             243               81    (-162; resume_* by-design only)
  NotSupported:     132551            132551
  Fatal-skip:       6                 6
  Pass rate of runnable: 76.2% -> 91.7% (+15.5pp)

100% of the iter15 winding-fail cluster closed. The remaining 81 fails
are all resume_* (pause/resume XFB, by design — we advertise
transformFeedbackDraw=false).

Second-model review (janet) produced 3 findings; Findings 1+2 were
already fixed in the in-tree applied state (stale applied_state/ snapshot
read by reviewer), Finding 3 (degenerate N underflow on N<2) addressed
by gating non-LIST emission on `output_count > 0` predicate.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-21 14:07:00 +02:00
marfrit 1c8c186681 Merge pull request 'daedalus-v4l2-dkms: 79256dc -> 6ffe92b — fix kernel panic regression from #67' (#69) from claude-noether/marfrit-packages:noether/daedalus-dkms-bump-6ffe92b into main
Reviewed-on: marfrit/marfrit-packages#69
2026-05-21 12:00:15 +00:00
claude-noether a0be2dcc9f daedalus-v4l2-dkms: 79256dc -> 6ffe92b — fix kernel panic from #7
Kernel-only bump.  Fixes the hard-reboot regression introduced by
the daedalus-v4l2#7 split-completion design and observed on higgs
(Pi CM5) during the first mpv vaapi-copy playback of 720p H.264:
device_run now removes src + dst from m2m_ctx's rdy_queue at the
moment it picks them up, not at buf_done time.  Without this, a
parked dst_buf (waiting for libavcodec's display-order release)
stayed in the rdy_queue and got re-picked by the next device_run
after SRC_CONSUMED's job_finish released the scheduler — two
inflight entries on the same vb2_buffer, later HAS_PIXELS calls
list_del on an already-detached list_head, panic.

DAEDALUS_PROTO_VERSION stays at 1 — daemon (userspace
daedalus-v4l2) need NOT bump in lockstep with this DKMS update.
The existing daedalus-v4l2 0.1.0+r28+g79256dc is wire-compatible
with daedalus-v4l2-dkms 0.1.0+r30+g6ffe92b.

Refs:
  * reauktion/daedalus-v4l2#8
2026-05-21 13:56:42 +02:00
marfrit eb89f12c3e Merge pull request 'libva-v4l2-request-fourier: bump pin to c454618 (#15 transparent resize)' (#68) from claude-noether/marfrit-packages:bump-libva-fourier-c454618-issue-15 into main
Reviewed-on: marfrit/marfrit-packages#68
2026-05-21 11:25:39 +00:00
55 changed files with 8720 additions and 58 deletions
+337 -10
View File
@@ -930,12 +930,13 @@ jobs:
# map 1:1 to the previous Arch list; libav*-dev intentionally
# absent (we are FFmpeg itself, providing those libs).
retry apt-get install -y --no-install-recommends \
build-essential git pkg-config nasm yasm \
build-essential cmake ninja-build git pkg-config nasm yasm \
linux-libc-dev libgl1-mesa-dev libasound2-dev libbz2-dev \
libfontconfig-dev libfribidi-dev libgmp-dev libgnutls28-dev \
libmp3lame-dev libass-dev libdav1d-dev libdrm-dev \
libfreetype-dev libpulse-dev libva-dev libvorbis-dev libvpx-dev \
libwebp-dev libx264-dev libx265-dev libxml2-dev libopus-dev \
libwebp-dev libx264-dev libx265-dev libopus-dev \
libvulkan-dev glslang-tools \
v4l-utils liblzma-dev zlib1g-dev \
curl ca-certificates openssh-client rsync dpkg-dev
@@ -1159,16 +1160,30 @@ jobs:
retry() { for i in 1 2 3; do "$@" && return 0; rc=$?; echo "retry $i (exit=$rc)" >&2; sleep $((i*5)); done; return 1; }
export DEBIAN_FRONTEND=noninteractive
retry apt-get update -qq
# libav*-dev provide the headers daedalus daemon dlopens at
# runtime — Debian's stock packages match the trixie ABI the
# daemon will encounter on Pi 5 hosts (both ship libavcodec
# 61.x). The fourier ffmpeg fork isn't needed here; the
# daemon never link-binds against libav (Option γ — dlopen
# at runtime), so any header set with the right struct
# definitions works.
# FFmpeg headers + sonames the daemon dlopens. As of
# daedalus-v4l2 PR #16 (commit 514da29), the daemon targets
# the Kwiboo fork's libavcodec.so.62 / libavformat.so.62 /
# libavutil.so.60 at /opt/fourier — so the build needs
# /opt/fourier/include and /opt/fourier/lib/pkgconfig.
# ffmpeg-v4l2-request-fourier provides both (plus the
# runtime libs the .deb will dlopen on the target host;
# we install it as a build-dep here and the dpkg-shlibdeps
# step pulls it into the daemon .deb's Depends automatically).
# Debian-stock libav*-dev removed — would conflict on
# /usr/include/libavcodec/avcodec.h vs /opt/fourier's copy.
#
# libvulkan-dev + glslang-tools: needed by the in-build
# daedalus-fourier fetch (build-deb.sh fetches the sibling
# library, cmake-builds it into a temp prefix, then the
# daedalus daemon static-links against it via pkg-config).
# Without these, daedalus-fourier's find_package(Vulkan)
# and glslangValidator find_program both fail at configure
# time. See marfrit/daedalus-fourier PR #1 +
# reauktion/daedalus-v4l2 PR #13.
retry apt-get install -y --no-install-recommends \
build-essential cmake ninja-build pkg-config git \
libavcodec-dev libavformat-dev libavutil-dev libdrm-dev \
ffmpeg-v4l2-request-fourier libdrm-dev \
libvulkan-dev glslang-tools \
linux-libc-dev \
curl ca-certificates openssh-client rsync dpkg-dev
@@ -1402,6 +1417,318 @@ jobs:
-e 'ssh -i /root/.ssh/id_ed25519' \
./ mfritsche@nc.reauktion.de:arch/aarch64/
- name: wipe secrets
if: always()
run: |
rm -f /root/repo_pass /root/.ssh/id_ed25519
rm -f /root/.ssh/id_ed25519_hertz
# -------------------------------------------------------------------------
# mesa-panvk-bifrost-video (aarch64 only) — sibling adding VK_KHR_video_decode_h264
# via the V4L2 hantro VPU. Phase 4 byte-exact validated 2026-05-21.
# Co-installs at /usr/lib/panvk-bifrost-video/ (parallel to r4); opt-in
# via VK_ICD_FILENAMES (no launcher shipped — uses standard Vulkan loader).
#
# Build is slow (~30-60min on actrunner-aarch64): full Mesa-from-source.
# Standalone job — no `needs:` since it doesn't depend on the fourier
# codec stack. continue-on-error so a build hiccup doesn't block other
# jobs in the same workflow run.
# -------------------------------------------------------------------------
mesa-panvk-bifrost-video-aarch64:
runs-on: arch-aarch64
continue-on-error: true
steps:
- uses: actions/checkout@v4
- name: skip if already published
id: skip-check
run: |
set -e
result=$(./.gitea/scripts/check-already-published.sh arch/mesa-panvk-bifrost-video)
echo "$result" >> "$GITHUB_OUTPUT"
echo "decision: $result"
- name: bootstrap runner (idempotent)
if: steps.skip-check.outputs.skip != '1'
run: |
set -e
retry() { for i in 1 2 3; do "$@" && return 0; rc=$?; echo "retry $i (exit=$rc)" >&2; sleep $((i*5)); done; return 1; }
retry pacman -Syu --noconfirm --needed base-devel git rsync gnupg openssh sudo
- name: import signing key
if: steps.skip-check.outputs.skip != '1'
env:
PRIV: ${{ secrets.MARFRIT_REPO_PRIVATE_KEY }}
PASS: ${{ secrets.MARFRIT_REPO_PASSPHRASE }}
run: |
set -e
gpgconf --homedir /root/.gnupg --kill all 2>/dev/null || true
rm -rf /root/.gnupg /root/repo_pass
mkdir -m700 -p /root/.gnupg
printf '%s' "$PASS" > /root/repo_pass
chmod 600 /root/repo_pass
printf '%s\n' "$PRIV" | gpg --batch --import
echo "92D5E96D8F63C75E4116AA1FF5C8C4603D0D250C:6:" | gpg --import-ownertrust
- name: install deploy ssh key
if: steps.skip-check.outputs.skip != '1'
env:
KEY: ${{ secrets.MARFRIT_REPO_DEPLOY_KEY }}
run: |
mkdir -m700 -p /root/.ssh
printf '%s\n' "$KEY" > /root/.ssh/id_ed25519
chmod 600 /root/.ssh/id_ed25519
ssh-keyscan -t ed25519 nc.reauktion.de > /root/.ssh/known_hosts 2>/dev/null
- name: makepkg mesa-panvk-bifrost-video
if: steps.skip-check.outputs.skip != '1'
run: |
set -e
rm -rf /tmp/build-mesa-panvk-bifrost-video
cp -r arch/mesa-panvk-bifrost-video /tmp/build-mesa-panvk-bifrost-video
chown -R builder:builder /tmp/build-mesa-panvk-bifrost-video
cd /tmp/build-mesa-panvk-bifrost-video
# MAKEFLAGS for parallel build; runner is multi-core.
# --skipinteg because sha256sums=SKIP in PKGBUILD (matches the
# fourier-fork PKGBUILD convention).
sudo -u builder -H env MAKEFLAGS="-j60" \
makepkg --nocheck --noconfirm --syncdeps --cleanbuild --skipinteg
ls -la *.pkg.tar.* | grep -v "\.sig$"
- name: sign mesa-panvk-bifrost-video
if: steps.skip-check.outputs.skip != '1'
run: |
set -e
cd /tmp/build-mesa-panvk-bifrost-video
for f in *.pkg.tar.xz *.pkg.tar.zst *.pkg.tar.gz; do
[ -f "$f" ] || continue
gpg --batch --pinentry-mode loopback --passphrase-file /root/repo_pass \
--detach-sign --yes -u 92D5E96D8F63C75E4116AA1FF5C8C4603D0D250C "$f"
done
- name: update aarch64 repo db
if: steps.skip-check.outputs.skip != '1'
run: |
set -e
mkdir -p /tmp/arch-stage-mesa-panvk-video
cd /tmp/arch-stage-mesa-panvk-video
rm -f *
for f in marfrit.db.tar.gz marfrit.db.tar.gz.sig marfrit.files.tar.gz marfrit.files.tar.gz.sig; do
curl -sSLf "https://packages.reauktion.de/arch/aarch64/$f" -o "$f" || rm -f "$f"
done
for ext in xz zst gz; do
ls /tmp/build-mesa-panvk-bifrost-video/*.pkg.tar.$ext 2>/dev/null && \
mv /tmp/build-mesa-panvk-bifrost-video/*.pkg.tar.$ext /tmp/build-mesa-panvk-bifrost-video/*.pkg.tar.$ext.sig .
done || true
export GNUPGHOME=/root/.gnupg
printf 'pinentry-mode loopback\npassphrase-file /root/repo_pass\n' > /root/.gnupg/gpg.conf
printf 'allow-loopback-pinentry\n' > /root/.gnupg/gpg-agent.conf
gpg-connect-agent reloadagent /bye
pkgs=()
for ext in xz zst gz; do
for f in *.pkg.tar.$ext; do [ -f "$f" ] && pkgs+=("$f"); done
done
if [ -f marfrit.db.tar.gz ]; then
for f in "${pkgs[@]}"; do
name=$(echo "$f" | sed -E 's/-[0-9].*//')
repo-remove --sign --key 92D5E96D8F63C75E4116AA1FF5C8C4603D0D250C \
marfrit.db.tar.gz "$name" 2>/dev/null || true
done
fi
repo-add --new --sign --key 92D5E96D8F63C75E4116AA1FF5C8C4603D0D250C \
--verify marfrit.db.tar.gz "${pkgs[@]}"
ln -sf marfrit.db.tar.gz marfrit.db
ln -sf marfrit.files.tar.gz marfrit.files
ln -sf marfrit.db.tar.gz.sig marfrit.db.sig
rm -f marfrit.files.sig
- name: publish to aarch64
if: steps.skip-check.outputs.skip != '1'
run: |
set -e
retry() { for i in 1 2 3; do "$@" && return 0; rc=$?; echo "retry $i (exit=$rc)" >&2; sleep $((i*5)); done; return 1; }
cd /tmp/arch-stage-mesa-panvk-video
retry rsync -avL --copy-unsafe-links \
-e 'ssh -i /root/.ssh/id_ed25519' \
./ mfritsche@nc.reauktion.de:arch/aarch64/
- name: wipe secrets
if: always()
run: rm -f /root/repo_pass /root/.ssh/id_ed25519
# -------------------------------------------------------------------------
# aish (arch=any) — pure LuaJIT, one .pkg.tar valid on every pacman target.
# Same dual-arch publish pattern as lmcp / claude-his.
# -------------------------------------------------------------------------
aish-any:
needs: lmcp-debian # parallel with claude-his-any (pure-Lua sibling),
# serialized via the shared arch-aarch64 runner.
# Avoids needless wait through the fourier stack.
runs-on: arch-aarch64
steps:
- uses: actions/checkout@v4
- name: skip if already published
id: skip-check
run: |
set -e
result=$(./.gitea/scripts/check-already-published.sh arch/aish)
echo "$result" >> "$GITHUB_OUTPUT"
echo "decision: $result"
- name: bootstrap runner (idempotent)
if: steps.skip-check.outputs.skip != '1'
run: |
set -e
retry() { for i in 1 2 3; do "$@" && return 0; rc=$?; echo "retry $i (exit=$rc)" >&2; sleep $((i*5)); done; return 1; }
retry pacman -Syu --noconfirm --needed base-devel git rsync gnupg openssh sudo luajit readline curl
- name: import signing key
if: steps.skip-check.outputs.skip != '1'
env:
PRIV: ${{ secrets.MARFRIT_REPO_PRIVATE_KEY }}
PASS: ${{ secrets.MARFRIT_REPO_PASSPHRASE }}
run: |
set -e
gpgconf --homedir /root/.gnupg --kill all 2>/dev/null || true
rm -rf /root/.gnupg /root/repo_pass
mkdir -m700 -p /root/.gnupg
printf '%s' "$PASS" > /root/repo_pass
chmod 600 /root/repo_pass
printf '%s\n' "$PRIV" | gpg --batch --import
echo "92D5E96D8F63C75E4116AA1FF5C8C4603D0D250C:6:" | gpg --import-ownertrust
- name: install deploy ssh key
if: steps.skip-check.outputs.skip != '1'
env:
KEY: ${{ secrets.MARFRIT_REPO_DEPLOY_KEY }}
run: |
mkdir -m700 -p /root/.ssh
printf '%s\n' "$KEY" > /root/.ssh/id_ed25519
chmod 600 /root/.ssh/id_ed25519
ssh-keyscan -t ed25519 nc.reauktion.de > /root/.ssh/known_hosts 2>/dev/null
- name: makepkg aish
if: steps.skip-check.outputs.skip != '1'
run: |
set -e
rm -rf /tmp/build-aish
cp -r arch/aish /tmp/build-aish
chown -R builder:builder /tmp/build-aish
cd /tmp/build-aish
sudo -u builder -H makepkg --nocheck --noconfirm --syncdeps --cleanbuild
ls -la *.pkg.tar.* | grep -v "\.sig$"
- name: sign aish
if: steps.skip-check.outputs.skip != '1'
run: |
set -e
cd /tmp/build-aish
for f in *.pkg.tar.xz *.pkg.tar.zst *.pkg.tar.gz; do
[ -f "$f" ] || continue
gpg --batch --pinentry-mode loopback --passphrase-file /root/repo_pass \
--detach-sign --yes -u 92D5E96D8F63C75E4116AA1FF5C8C4603D0D250C "$f"
done
- name: publish aish to both arches
if: steps.skip-check.outputs.skip != '1'
run: |
set -e
retry() { for i in 1 2 3; do "$@" && return 0; rc=$?; echo "retry $i (exit=$rc)" >&2; sleep $((i*5)); done; return 1; }
export GNUPGHOME=/root/.gnupg
printf 'pinentry-mode loopback\npassphrase-file /root/repo_pass\n' > /root/.gnupg/gpg.conf
printf 'allow-loopback-pinentry\n' > /root/.gnupg/gpg-agent.conf
gpg-connect-agent reloadagent /bye
for target in aarch64 x86_64; do
stage="/tmp/arch-stage-$target"
rm -rf "$stage"; mkdir -p "$stage"; cd "$stage"
for f in marfrit.db.tar.gz marfrit.db.tar.gz.sig marfrit.files.tar.gz marfrit.files.tar.gz.sig; do
curl -sSLf "https://packages.reauktion.de/arch/$target/$f" -o "$f" || rm -f "$f"
done
cp /tmp/build-aish/*.pkg.tar.* .
pkgs=()
for ext in xz zst gz; do
for f in *.pkg.tar.$ext; do [ -f "$f" ] && pkgs+=("$f"); done
done
if [ -f marfrit.db.tar.gz ]; then
for f in "${pkgs[@]}"; do
name=$(echo "$f" | sed -E 's/-[0-9].*//')
repo-remove --sign --key 92D5E96D8F63C75E4116AA1FF5C8C4603D0D250C \
marfrit.db.tar.gz "$name" 2>/dev/null || true
done
fi
repo-add --new --sign --key 92D5E96D8F63C75E4116AA1FF5C8C4603D0D250C \
--verify marfrit.db.tar.gz "${pkgs[@]}"
ln -sf marfrit.db.tar.gz marfrit.db
ln -sf marfrit.files.tar.gz marfrit.files
ln -sf marfrit.db.tar.gz.sig marfrit.db.sig
ln -sf marfrit.files.tar.gz.sig marfrit.files.sig
retry rsync -avL --copy-unsafe-links \
-e 'ssh -i /root/.ssh/id_ed25519' \
./ "mfritsche@nc.reauktion.de:arch/$target/"
done
- name: wipe secrets
if: always()
run: rm -f /root/repo_pass /root/.ssh/id_ed25519
aish-debian:
needs: aish-any # serialize after the Arch build to share the runner
runs-on: arch-aarch64
steps:
- uses: actions/checkout@v4
- name: skip if already published
id: skip-check
run: |
set -e
result=$(./.gitea/scripts/check-already-published.sh debian/aish)
echo "$result" >> "$GITHUB_OUTPUT"
echo "decision: $result"
- name: install dpkg
if: steps.skip-check.outputs.skip != '1'
run: |
set -e
retry() { for i in 1 2 3; do "$@" && return 0; rc=$?; echo "retry $i (exit=$rc)" >&2; sleep $((i*5)); done; return 1; }
retry pacman -Syu --noconfirm --needed dpkg openssh rsync curl
- name: install hertz deploy ssh key
if: steps.skip-check.outputs.skip != '1'
env:
KEY: ${{ secrets.MARFRIT_REPO_HERTZ_KEY }}
run: |
mkdir -m700 -p /root/.ssh
printf '%s\n' "$KEY" > /root/.ssh/id_ed25519_hertz
chmod 600 /root/.ssh/id_ed25519_hertz
ssh-keyscan -t ed25519 hertz.fritz.box >> /root/.ssh/known_hosts 2>/dev/null
- name: build aish .deb
if: steps.skip-check.outputs.skip != '1'
run: |
set -e
cd debian/aish
./build-deb.sh
ls -la *.deb
- name: upload + publish to suites
if: steps.skip-check.outputs.skip != '1'
run: |
set -e
retry() { for i in 1 2 3; do "$@" && return 0; rc=$?; echo "retry $i (exit=$rc)" >&2; sleep $((i*5)); done; return 1; }
cd debian/aish
DEB=$(ls aish_*.deb | head -1)
# Push the .deb into hertz's incoming dir via rrsync.
retry rsync -av -e 'ssh -i /root/.ssh/id_ed25519_hertz' "$DEB" \
marfritrepo@hertz.fritz.box:
# Trigger reprepro for each suite.
for suite in bookworm trixie; do
retry ssh -i /root/.ssh/id_ed25519_hertz marfritrepo@hertz.fritz.box \
"publish-deb $suite $DEB"
done
- name: wipe secrets
if: always()
run: rm -f /root/.ssh/id_ed25519_hertz
+53
View File
@@ -0,0 +1,53 @@
# Maintainer: Markus Fritsche <mfritsche@reauktion.de>
# aish — AI-augmented conversational shell in LuaJIT.
# Source of truth: git.reauktion.de/marfrit/aish
pkgname=aish
pkgver=0.1.0
pkgrel=1
pkgdesc="AI-augmented conversational shell (LuaJIT, FFI-only)"
arch=('any')
url="https://git.reauktion.de/marfrit/aish"
license=('MIT')
depends=('luajit' 'readline' 'curl')
# The _tag back-translation handles both clean releases (no '_') and
# pre-release pkgvers (e.g. 0.1.0_rc1 → v0.1.0-rc1).
_tag="v${pkgver//_/-}"
source=("${pkgname}-${pkgver}.tar.gz::https://git.reauktion.de/marfrit/aish/archive/${_tag}.tar.gz")
sha256sums=('9ebc3939e028832e39391ae33efacb5ec9bcd99d123cbc8ca1cd6ca9a640b5b5')
package() {
cd "${pkgname}"
local libdir="${pkgdir}/usr/share/lua/5.1/aish"
# Top-level modules
install -Dm644 main.lua "${libdir}/main.lua"
install -Dm644 broker.lua "${libdir}/broker.lua"
install -Dm644 context.lua "${libdir}/context.lua"
install -Dm644 executor.lua "${libdir}/executor.lua"
install -Dm644 history.lua "${libdir}/history.lua"
install -Dm644 mcp.lua "${libdir}/mcp.lua"
install -Dm644 renderer.lua "${libdir}/renderer.lua"
install -Dm644 repl.lua "${libdir}/repl.lua"
install -Dm644 router.lua "${libdir}/router.lua"
install -Dm644 safety.lua "${libdir}/safety.lua"
install -Dm644 secrets.lua "${libdir}/secrets.lua"
# FFI bindings
install -Dm644 ffi/curl.lua "${libdir}/ffi/curl.lua"
install -Dm644 ffi/libc.lua "${libdir}/ffi/libc.lua"
install -Dm644 ffi/pty.lua "${libdir}/ffi/pty.lua"
install -Dm644 ffi/readline.lua "${libdir}/ffi/readline.lua"
# Vendored dependencies
install -Dm644 vendor/dkjson.lua "${libdir}/vendor/dkjson.lua"
# Launch wrapper
install -Dm755 bin/aish "${pkgdir}/usr/bin/aish"
# Documentation + example config
install -Dm644 README.md "${pkgdir}/usr/share/doc/${pkgname}/README.md"
install -Dm644 LICENSE "${pkgdir}/usr/share/doc/${pkgname}/LICENSE"
install -Dm644 examples/config.lua \
"${pkgdir}/usr/share/doc/${pkgname}/examples/config.lua"
}
+5 -6
View File
@@ -8,13 +8,13 @@
# NEXT.md alongside this PKGBUILD for the full rationale and the
# validation log on PineTab2 (RK3566).
#
# Multi-arch: builds natively on x86_64 and aarch64. The x86_64 path
# is primarily a development / CI host; the runtime target audience is
# aarch64. The two patches are architecture-independent.
# Cross-compiled from x86_64 using chromium's bundled clang (upstream
# LLVM doesn't ship clang 23+ yet; chromium's internal fork is required).
# Runtime target is aarch64. The three patches are architecture-independent.
pkgname=chromium-fourier
pkgver=147.0.7727.116
pkgrel=2
pkgver=148.0.7778.178
pkgrel=1
epoch=1
pkgdesc='Chromium with V4L2VDA HW video decode unlocked for mainline Linux Wayland on Rockchip'
arch=('aarch64' 'x86_64')
@@ -150,7 +150,6 @@ build() {
'symbol_level=0'
'is_cfi=false'
'treat_warnings_as_errors=false'
'enable_nacl=false'
'enable_widevine=false'
# System toolchain (clang/lld from pacman)
@@ -73,7 +73,7 @@ diff --git a/ui/ozone/common/native_pixmap_egl_binding.cc b/ui/ozone/common/nati
index 31877f4459..6855c1093e 100644
--- a/ui/ozone/common/native_pixmap_egl_binding.cc
+++ b/ui/ozone/common/native_pixmap_egl_binding.cc
@@ -6,10 +6,13 @@
@@ -6,9 +6,12 @@
#include <array>
@@ -82,7 +82,6 @@ index 31877f4459..6855c1093e 100644
#include "base/memory/scoped_refptr.h"
+#include "base/no_destructor.h"
#include "base/notreached.h"
#include "base/numerics/safe_conversions.h"
+#include "base/synchronization/lock.h"
#include "ui/gfx/linux/drm_util_linux.h"
#include "ui/gl/gl_bindings.h"
+8 -5
View File
@@ -18,12 +18,15 @@ _module=daedalus_v4l2
# Same pin as arch/daedalus-v4l2 — keep kernel module + daemon
# bit-versioned together so the chardev wire protocol stays in sync.
# PROTO_VERSION 0 → 1 at this pin (H.264 B-frame reorder fix); must
# install both packages atomically.
_commit=79256dc7ef41f83873ca9c23db20f5888858e65d
# 5d8b436 reverts PRs #7 + #8 (parking design that broke libva's
# 1:1 contract — see daedalus-v4l2#9 + #10). Tree is
# content-equivalent to f0d4186 plus PR #4 (cosmetic menu ctrls).
# PROTO_VERSION drops 1 → 0; lock-step install with
# daedalus-v4l2 0.1.0.r33.5d8b436 REQUIRED.
_commit=872eec505eb91b561892d02a0526749348ddc121
pkgver=0.1.0.r28.79256dc
pkgrel=1 # reset for new upstream pin (79256dc — H.264 B-frame reorder fix)
pkgver=0.1.0.r45.872eec5
pkgrel=1 # reset for new upstream pin (872eec5 — PROTO_MAX_PAYLOAD 64 KiB -> 1 MiB, closes #19); lock-step with daedalus-v4l2 0.1.0.r45.872eec5 REQUIRED
pkgdesc="V4L2 stateless decoder shim kernel module (DKMS) — Pi 5 / CM5"
arch=('any')
url="https://git.reauktion.de/reauktion/daedalus-v4l2"
+11 -10
View File
@@ -16,18 +16,19 @@
pkgname=daedalus-v4l2
_upstreampkg=daedalus-v4l2
# Pin the daedalus-v4l2 tip. 79256dc = "kernel + daemon: H.264 B-frame
# display reorder fix (closes #6)" — adds the wire-protocol src_pts /
# output_src_pts / RESP_FRAME flags split that lets H.264 streams with
# B-frames preserve display order through libva → kernel → daemon.
# PROTO_VERSION bumps 0 → 1; lock-step userspace + kernel rebuild
# REQUIRED (daedalus-v4l2-dkms PKGBUILD pinned to the same commit).
_commit=79256dc7ef41f83873ca9c23db20f5888858e65d
# 6e6dfa1 = picks up daedalus-v4l2 PR #16 — daemon now dlopens
# the Kwiboo fourier fork's libavcodec.so.62 / libavformat.so.62 /
# libavutil.so.60 at /opt/fourier instead of Debian-stock soname
# 61/61/59. First step on the daedalus-fourier substitution arc
# (daedalus-v4l2#11). Daemon still needs daedalus-fourier at
# build time (Arch packaging for that is a follow-up; Debian side
# fetches inline via build-deb.sh).
_commit=872eec505eb91b561892d02a0526749348ddc121
# 0.1.0 (pre-1.0) + commit count + short sha. Bump the .Y on each
# Phase 8.x close. pkgver() recomputes at build time.
pkgver=0.1.0.r28.79256dc
pkgrel=1 # reset for new upstream pin (79256dc — H.264 B-frame reorder fix)
pkgver=0.1.0.r45.872eec5
pkgrel=1 # reset for new upstream pin (872eec5 — PROTO_MAX_PAYLOAD 64 KiB -> 1 MiB, closes #19); lock-step with daedalus-v4l2-dkms 0.1.0.r45.872eec5 REQUIRED
pkgdesc="Userspace daemon for the daedalus-v4l2 V4L2 stateless decoder shim (VP9/AV1/H.264 on Pi 5 / CM5)"
arch=('aarch64')
url="https://git.reauktion.de/reauktion/daedalus-v4l2"
@@ -35,7 +36,7 @@ license=('BSD-2-Clause' 'GPL-2.0-or-later')
# Daemon dlopens libavformat.so.61 / libavcodec.so.61 / libavutil.so.59
# at runtime (Option γ — see daemon/src/ffmpeg_loader.h). ffmpeg
# provides those; we don't link them.
depends=('ffmpeg' 'libdrm')
depends=('ffmpeg-v4l2-request-fourier' 'libdrm')
# Headers from libav*-dev needed at compile time for type-safe function
# pointer signatures; pkg-config locates them.
makedepends=('cmake' 'ninja' 'pkgconf' 'git' 'ffmpeg')
@@ -0,0 +1,137 @@
From f760c0541586f43334c02611fcb4c212c08ad576 Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Thu, 21 May 2026 21:40:22 +0200
Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 4x4 IDCT through
daedalus-fourier
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
H264DSPContext.idct_add (called per 4x4 block from the intra-4x4
decode path in h264_mb.c) now dispatches through
daedalus_recipe_dispatch_h264_idct4 instead of ff_h264_idct_add_neon.
The recipe layer picks the substrate; for cycle 6 (H.264 IDCT 4x4)
the recipe is CPU NEON, so this is effectively a NEON-to-NEON
substitution with one extra dispatch call and recipe-table lookup.
Provides the first end-to-end exercise of the daedalus-fourier
kernel pack inside the libavcodec.so decode hot path; follow-up
patches wire IDCT 8x8, luma-v deblock, and qpel mc20.
The library context is process-global, lazily initialised under
pthread_once on first call. We pick the no-QPU constructor because
libavcodec.so is loaded into arbitrary host processes
(firefox-fourier, mpv-fourier, daedalus_v4l2_daemon, ...) and we
cannot assume the host has a usable Vulkan instance. Higher cycles
(deblock luma-v, MC) that benefit from the QPU will provision their
own recipe-selected context once that path is wired.
Bulk paths (idct_add16, idct_add16intra, idct_add8 — used for
non-intra4x4 macroblocks) remain on the stock NEON .S implementations
and will be batched through daedalus_recipe_dispatch_h264_idct4 with
n_blocks>1 in a follow-up.
Bit-exact against ff_h264_idct_add_neon (daedalus-fourier cycle 6
green; see marfrit/daedalus-fourier/CYCLE_LOGS.md).
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2.
---
libavcodec/aarch64/Makefile | 3 +-
libavcodec/aarch64/h264_idct_daedalus.c | 49 +++++++++++++++++++++++
libavcodec/aarch64/h264dsp_init_aarch64.c | 3 +-
3 files changed, 53 insertions(+), 2 deletions(-)
create mode 100644 libavcodec/aarch64/h264_idct_daedalus.c
diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
index 41ab025..7b95fb1 100644
--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
@@ -3,7 +3,8 @@ OBJS-$(CONFIG_AC3DSP) += aarch64/ac3dsp_init_aarch64.o
OBJS-$(CONFIG_FDCTDSP) += aarch64/fdctdsp_init_aarch64.o
OBJS-$(CONFIG_FMTCONVERT) += aarch64/fmtconvert_init.o
OBJS-$(CONFIG_H264CHROMA) += aarch64/h264chroma_init_aarch64.o
-OBJS-$(CONFIG_H264DSP) += aarch64/h264dsp_init_aarch64.o
+OBJS-$(CONFIG_H264DSP) += aarch64/h264dsp_init_aarch64.o \
+ aarch64/h264_idct_daedalus.o
OBJS-$(CONFIG_HUFFYUVDSP) += aarch64/huffyuvdsp_init_aarch64.o
OBJS-$(CONFIG_H264PRED) += aarch64/h264pred_init.o
OBJS-$(CONFIG_H264QPEL) += aarch64/h264qpel_init_aarch64.o
diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
new file mode 100644
index 0000000..538d223
--- /dev/null
+++ b/libavcodec/aarch64/h264_idct_daedalus.c
@@ -0,0 +1,49 @@
+/*
+ * H.264 4x4 IDCT + add — daedalus-fourier substitution shim.
+ *
+ * Routes H264DSPContext.idct_add through
+ * daedalus_recipe_dispatch_h264_idct4 instead of ff_h264_idct_add_neon.
+ * The recipe layer picks the substrate (CPU NEON by default for
+ * cycle 6; future cycles may dispatch to V3D opportunistically).
+ *
+ * FFmpeg's 4x4 block memory layout matches daedalus's column-major
+ * convention: block[r + 4*c] = coefficient at (row r, col c). Both
+ * sides destructively zero the block after the transform.
+ *
+ * The library context is process-global and lazily initialised under
+ * pthread_once. We pick the no-QPU constructor here because
+ * libavcodec.so is loaded into arbitrary host processes
+ * (firefox-fourier, mpv-fourier, daedalus_v4l2_daemon, ...) and we
+ * cannot assume the host has a usable Vulkan instance. Higher cycles
+ * (deblock, MC) that benefit from the QPU initialise their own
+ * recipe-selected context once that path is wired.
+ */
+
+#include <pthread.h>
+#include <stddef.h>
+#include <stdint.h>
+
+#include <daedalus.h>
+
+#include "libavutil/attributes.h"
+#include "libavcodec/h264dsp.h"
+
+static daedalus_ctx *g_dctx;
+static pthread_once_t g_dctx_once = PTHREAD_ONCE_INIT;
+
+static void daedalus_ctx_init_once(void)
+{
+ g_dctx = daedalus_ctx_create_no_qpu();
+}
+
+void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+
+void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
+{
+ static const daedalus_h264_block_meta meta = { .dst_off = 0 };
+
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+ daedalus_recipe_dispatch_h264_idct4(g_dctx, dst, (size_t)stride,
+ block, 1, &meta);
+}
diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
index c684574..b993df2 100644
--- a/libavcodec/aarch64/h264dsp_init_aarch64.c
+++ b/libavcodec/aarch64/h264dsp_init_aarch64.c
@@ -66,6 +66,7 @@ void ff_biweight_h264_pixels_4_neon(uint8_t *dst, uint8_t *src, ptrdiff_t stride
int weights, int offset);
void ff_h264_idct_add_neon(uint8_t *dst, int16_t *block, int stride);
+void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
void ff_h264_idct_dc_add_neon(uint8_t *dst, int16_t *block, int stride);
void ff_h264_idct_add16_neon(uint8_t *dst, const int *block_offset,
int16_t *block, int stride,
@@ -139,7 +140,7 @@ av_cold void ff_h264dsp_init_aarch64(H264DSPContext *c, const int bit_depth,
c->biweight_pixels_tab[1] = ff_biweight_h264_pixels_8_neon;
c->biweight_pixels_tab[2] = ff_biweight_h264_pixels_4_neon;
- c->idct_add = ff_h264_idct_add_neon;
+ c->idct_add = ff_h264_idct_add_daedalus;
c->idct_dc_add = ff_h264_idct_dc_add_neon;
c->idct_add16 = ff_h264_idct_add16_neon;
c->idct_add16intra = ff_h264_idct_add16intra_neon;
--
2.47.3
@@ -0,0 +1,107 @@
From 1b286ddb4efaca26ec9b9e290e989fec77dc1c77 Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Fri, 22 May 2026 10:18:21 +0200
Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 8x8 IDCT through
daedalus-fourier
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
H264DSPContext.idct8_add (called per 8x8 block from the High-profile
intra-8x8-DCT decode path in h264_mb.c) now dispatches through
daedalus_recipe_dispatch_h264_idct8 instead of ff_h264_idct8_add_neon.
The recipe layer picks the substrate; for cycle 7 (H.264 IDCT 8x8)
the recipe is CPU NEON, so this is effectively a NEON-to-NEON
substitution layered on top of the cycle-6 IDCT 4x4 wiring. Same
pthread_once global context, same destructive-zero semantics; FFmpeg
column-major 8x8 storage block[r + 8*c] matches daedalus's convention.
Bulk path c->idct8_add4 (used for inter 8x8-DCT macroblocks) remains
on the in-tree NEON .S code and will be batched through
daedalus_recipe_dispatch_h264_idct8 with n_blocks>1 in a follow-up.
Bit-exact against ff_h264_idct8_add_neon (daedalus-fourier cycle 7
green).
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 7.
---
libavcodec/aarch64/h264_idct_daedalus.c | 29 ++++++++++++++++-------
libavcodec/aarch64/h264dsp_init_aarch64.c | 3 ++-
2 files changed, 23 insertions(+), 9 deletions(-)
diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
index 538d223..cbb98af 100644
--- a/libavcodec/aarch64/h264_idct_daedalus.c
+++ b/libavcodec/aarch64/h264_idct_daedalus.c
@@ -1,14 +1,16 @@
/*
- * H.264 4x4 IDCT + add — daedalus-fourier substitution shim.
+ * H.264 4x4 / 8x8 IDCT + add — daedalus-fourier substitution shims.
*
- * Routes H264DSPContext.idct_add through
- * daedalus_recipe_dispatch_h264_idct4 instead of ff_h264_idct_add_neon.
- * The recipe layer picks the substrate (CPU NEON by default for
- * cycle 6; future cycles may dispatch to V3D opportunistically).
+ * Routes H264DSPContext.idct_add → daedalus_recipe_dispatch_h264_idct4
+ * H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
+ * instead of the in-tree ff_h264_idct{,8}_add_neon assembly. The
+ * recipe layer picks the substrate (CPU NEON by default for cycles
+ * 6 + 7; future cycles may dispatch to V3D opportunistically).
*
- * FFmpeg's 4x4 block memory layout matches daedalus's column-major
- * convention: block[r + 4*c] = coefficient at (row r, col c). Both
- * sides destructively zero the block after the transform.
+ * FFmpeg's 4x4 and 8x8 block memory layouts match daedalus's
+ * column-major convention: block[r + N*c] = coefficient at
+ * (row r, col c) for N ∈ {4, 8}. Both sides destructively zero the
+ * block after the transform.
*
* The library context is process-global and lazily initialised under
* pthread_once. We pick the no-QPU constructor here because
@@ -37,6 +39,7 @@ static void daedalus_ctx_init_once(void)
}
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
{
@@ -47,3 +50,13 @@ void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
daedalus_recipe_dispatch_h264_idct4(g_dctx, dst, (size_t)stride,
block, 1, &meta);
}
+
+void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride)
+{
+ static const daedalus_h264_block_meta meta = { .dst_off = 0 };
+
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+ daedalus_recipe_dispatch_h264_idct8(g_dctx, dst, (size_t)stride,
+ block, 1, &meta);
+}
diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
index b993df2..741e551 100644
--- a/libavcodec/aarch64/h264dsp_init_aarch64.c
+++ b/libavcodec/aarch64/h264dsp_init_aarch64.c
@@ -79,6 +79,7 @@ void ff_h264_idct_add8_neon(uint8_t **dest, const int *block_offset,
const uint8_t nnzc[15 * 8]);
void ff_h264_idct8_add_neon(uint8_t *dst, int16_t *block, int stride);
+void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
void ff_h264_idct8_dc_add_neon(uint8_t *dst, int16_t *block, int stride);
void ff_h264_idct8_add4_neon(uint8_t *dst, const int *block_offset,
int16_t *block, int stride,
@@ -146,7 +147,7 @@ av_cold void ff_h264dsp_init_aarch64(H264DSPContext *c, const int bit_depth,
c->idct_add16intra = ff_h264_idct_add16intra_neon;
if (chroma_format_idc <= 1)
c->idct_add8 = ff_h264_idct_add8_neon;
- c->idct8_add = ff_h264_idct8_add_neon;
+ c->idct8_add = ff_h264_idct8_add_daedalus;
c->idct8_dc_add = ff_h264_idct8_dc_add_neon;
c->idct8_add4 = ff_h264_idct8_add4_neon;
} else if (have_neon(cpu_flags) && bit_depth == 10) {
--
2.47.3
@@ -0,0 +1,121 @@
From 68731c41d7ea68be0e912b128cb4e71fb56e8263 Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Fri, 22 May 2026 12:15:16 +0200
Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 luma-v deblock through
daedalus-fourier
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
H264DSPContext.v_loop_filter_luma (non-intra bS<4 vertical luma
deblock, called per macroblock-row edge from the slice deblock
loop) now dispatches through
daedalus_recipe_dispatch_h264_deblock_luma_v instead of
ff_h264_v_loop_filter_luma_neon.
The recipe layer picks the substrate; for cycle 8 the daedalus
docstring marks the kernel "CPU primary; QPU opportunistic", but
the libavcodec.so context here is built with
daedalus_ctx_create_no_qpu — process-global pthread_once init,
shared with cycles 6/7. QPU opportunism stays gated off until a
follow-up adds an explicit feature flag (no implicit Vulkan init
in arbitrary host processes). In the meantime cycle 8 is a
plumbing-only substitution, NEON-to-NEON via the daedalus recipe.
Intra (bS=4) loop filter — c->v_loop_filter_luma_intra — stays on
the in-tree NEON .S code; daedalus's daedalus_h264_deblock_meta
only covers the non-intra path per its docstring.
FFmpeg `int alpha/beta/int8_t tc0[4]` → daedalus_h264_deblock_meta
(int32_t alpha/beta + inline int8_t tc0[4]). pix already points
to row 0 of the bottom block per FFmpeg's deblock convention,
satisfying daedalus's `dst_off >= 4 * dst_stride` constraint.
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 8.
---
libavcodec/aarch64/h264_idct_daedalus.c | 36 +++++++++++++++++++----
libavcodec/aarch64/h264dsp_init_aarch64.c | 4 ++-
2 files changed, 33 insertions(+), 7 deletions(-)
diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
index cbb98af..92365fa 100644
--- a/libavcodec/aarch64/h264_idct_daedalus.c
+++ b/libavcodec/aarch64/h264_idct_daedalus.c
@@ -1,11 +1,14 @@
/*
- * H.264 4x4 / 8x8 IDCT + add — daedalus-fourier substitution shims.
+ * H.264 4x4 / 8x8 IDCT + luma-v deblock — daedalus-fourier substitution shims.
*
- * Routes H264DSPContext.idct_add → daedalus_recipe_dispatch_h264_idct4
- * H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
- * instead of the in-tree ff_h264_idct{,8}_add_neon assembly. The
- * recipe layer picks the substrate (CPU NEON by default for cycles
- * 6 + 7; future cycles may dispatch to V3D opportunistically).
+ * Routes H264DSPContext.idct_add → daedalus_recipe_dispatch_h264_idct4
+ * H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
+ * H264DSPContext.v_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_v
+ * instead of the in-tree ff_h264_*_neon assembly. The recipe layer
+ * picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
+ * is CPU primary with QPU opportunistic — the ctx below is no-QPU,
+ * so cycle 8 stays on the CPU NEON path until a separate change
+ * gates QPU init on a daedalus-fourier feature flag).
*
* FFmpeg's 4x4 and 8x8 block memory layouts match daedalus's
* column-major convention: block[r + N*c] = coefficient at
@@ -40,6 +43,8 @@ static void daedalus_ctx_init_once(void)
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0);
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
{
@@ -60,3 +65,22 @@ void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride)
daedalus_recipe_dispatch_h264_idct8(g_dctx, dst, (size_t)stride,
block, 1, &meta);
}
+
+void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0)
+{
+ daedalus_h264_deblock_meta meta = {
+ .dst_off = 0,
+ .alpha = alpha,
+ .beta = beta,
+ };
+ meta.tc0[0] = tc0[0];
+ meta.tc0[1] = tc0[1];
+ meta.tc0[2] = tc0[2];
+ meta.tc0[3] = tc0[3];
+
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+ daedalus_recipe_dispatch_h264_deblock_luma_v(g_dctx, pix, (size_t)stride,
+ 1, &meta);
+}
diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
index 741e551..85ac381 100644
--- a/libavcodec/aarch64/h264dsp_init_aarch64.c
+++ b/libavcodec/aarch64/h264dsp_init_aarch64.c
@@ -27,6 +27,8 @@
void ff_h264_v_loop_filter_luma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta, int8_t *tc0);
+void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0);
void ff_h264_h_loop_filter_luma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta, int8_t *tc0);
void ff_h264_v_loop_filter_luma_intra_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
@@ -114,7 +116,7 @@ av_cold void ff_h264dsp_init_aarch64(H264DSPContext *c, const int bit_depth,
int cpu_flags = av_get_cpu_flags();
if (have_neon(cpu_flags) && bit_depth == 8) {
- c->v_loop_filter_luma = ff_h264_v_loop_filter_luma_neon;
+ c->v_loop_filter_luma = ff_h264_v_loop_filter_luma_daedalus;
c->h_loop_filter_luma = ff_h264_h_loop_filter_luma_neon;
c->v_loop_filter_luma_intra= ff_h264_v_loop_filter_luma_intra_neon;
c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_neon;
--
2.47.3
@@ -0,0 +1,82 @@
From 0d1292ea99bc4e5fa2da438259fa01a2374e3e04 Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Fri, 22 May 2026 14:18:25 +0200
Subject: [PATCH] avcodec/h264: restore AV_CODEC_FLAG_LOW_DELAY semantics
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
FFmpeg 8.x dropped the H.264 decoder's low_delay path —
AV_CODEC_FLAG_LOW_DELAY no longer prevents
h264_select_output_frame from running the display-order DPB
output queue. V4L2-stateless-style consumers (daedalus-v4l2
daemon, libva-v4l2-request-fourier) that set the flag end up
seeing the 2-1-4-3 pair-swap pattern on B-frame streams again.
Restore the documented semantics:
- Early-exit at the top of h264_select_output_frame when the
flag is set: emit the just-decoded picture immediately as
next_output_pic, mirror the corruption / recovery-point
tracking the main path performs, and skip the entire
delayed_pic[] / POC reorder machinery.
- Suppress the SPS-driven has_b_frames clobber in
h264_field_start when the flag is set, so the per-slice
bitstream_restriction_flag re-pickup cannot reintroduce a
nonzero reorder buffer mid-stream.
This is a fork-only change required by the daedalus-v4l2 daemon's
one-frame-per-send_packet contract; upstream FFmpeg consumers that
expect display-order output remain untouched (flag default = off).
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 deblock
+ flag-restoration follow-up.
---
libavcodec/h264_slice.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/libavcodec/h264_slice.c b/libavcodec/h264_slice.c
index 97fab70..a7bfbd6 100644
--- a/libavcodec/h264_slice.c
+++ b/libavcodec/h264_slice.c
@@ -1308,6 +1308,28 @@ static int h264_select_output_frame(H264Context *h)
cur->mmco_reset = h->mmco_reset;
h->mmco_reset = 0;
+ /* AV_CODEC_FLAG_LOW_DELAY restore (FFmpeg 8.x dropped the H.264
+ * decoder's low_delay path). Bypass the display-order DPB
+ * output queue: emit the just-decoded picture immediately, in
+ * decode order, one per send_packet. V4L2-stateless-style
+ * consumers (daedalus-v4l2 daemon, libva-v4l2-request-fourier)
+ * do their own POC-based reorder downstream and require this
+ * behaviour. */
+ if (h->avctx->flags & AV_CODEC_FLAG_LOW_DELAY) {
+ h->next_output_pic = cur;
+ h->next_outputed_poc = cur->poc;
+ h->frame_recovered |= cur->recovered;
+ cur->recovered |= h->frame_recovered & FRAME_RECOVERED_SEI;
+ if (!cur->recovered) {
+ if (!(h->avctx->flags & AV_CODEC_FLAG_OUTPUT_CORRUPT) &&
+ !(h->avctx->flags2 & AV_CODEC_FLAG2_SHOW_ALL))
+ h->next_output_pic = NULL;
+ else
+ cur->f->flags |= AV_FRAME_FLAG_CORRUPT;
+ }
+ return 0;
+ }
+
if (sps->bitstream_restriction_flag ||
h->avctx->strict_std_compliance >= FF_COMPLIANCE_STRICT) {
h->avctx->has_b_frames = FFMAX(h->avctx->has_b_frames, sps->num_reorder_frames);
@@ -1415,6 +1437,7 @@ static int h264_field_start(H264Context *h, const H264SliceContext *sl,
sps = h->ps.sps;
if (sps->bitstream_restriction_flag &&
+ !(h->avctx->flags & AV_CODEC_FLAG_LOW_DELAY) &&
h->avctx->has_b_frames < sps->num_reorder_frames) {
h->avctx->has_b_frames = sps->num_reorder_frames;
}
--
2.47.3
@@ -0,0 +1,139 @@
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Sat, 23 May 2026 12:00:00 +0200
Subject: [PATCH] avcodec/aarch64/h264qpel: route 8x8 mc20 through
daedalus-fourier
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma horizontal
half-pel, 6-tap "put" variant — the canonical representative of the
H.264 luma motion-compensation family) now dispatches through
daedalus_recipe_dispatch_h264_qpel_mc20 instead of
ff_put_h264_qpel8_mc20_neon.
Cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes the
4-cycle libavcodec.so substitution sequence (6 IDCT 4x4 / 7 IDCT 8x8 /
8 luma-v deblock / 9 qpel mc20).
The recipe layer picks the substrate. Per docs/k9_h264qpel_mc20.md
the verdict is CPU NEON: per-block 7.6 ns at 131 Mblock/s gives 135x
margin over 30 fps 1080p, and the QPU dispatch floor (~250 ns)
makes any V3D shader strictly worse. Substitution is plumbing-only,
NEON-by-recipe — same daedalus_ctx_create_no_qpu pthread_once
context shape the cycles 6/7/8 shims already own (kept SEPARATE
from the H264DSP shim's ctx because H264QPEL is its own libavcodec
Makefile module and link order does not guarantee a single .o
owns the ctx symbol; one extra ~µs init per process, paid lazily).
Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the 16x16
size tier stay on the in-tree NEON .S code. Per the cycle-9 phase-1
rationale, mc20 8x8 is representative of the whole family's per-block
cost — extending the substitution to other variants would multiply
recipe-lookup overhead without changing the substrate verdict.
Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier
cycle 9 green; M1 = 100% bit-exact across 10000 random blocks).
No SONAME change, no Depends change.
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
---
libavcodec/aarch64/Makefile | 3 +-
libavcodec/aarch64/h264_qpel_daedalus.c | 50 ++++++++++++++++++++++
libavcodec/aarch64/h264qpel_init_aarch64.c | 4 +-
3 files changed, 55 insertions(+), 2 deletions(-)
create mode 100644 libavcodec/aarch64/h264_qpel_daedalus.c
diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
@@ -7,7 +7,8 @@ OBJS-$(CONFIG_H264DSP) += aarch64/h264dsp_init_aarch64.o \
aarch64/h264_idct_daedalus.o
OBJS-$(CONFIG_HUFFYUVDSP) += aarch64/huffyuvdsp_init_aarch64.o
OBJS-$(CONFIG_H264PRED) += aarch64/h264pred_init.o
-OBJS-$(CONFIG_H264QPEL) += aarch64/h264qpel_init_aarch64.o
+OBJS-$(CONFIG_H264QPEL) += aarch64/h264qpel_init_aarch64.o \
+ aarch64/h264_qpel_daedalus.o
OBJS-$(CONFIG_HPELDSP) += aarch64/hpeldsp_init_aarch64.o
OBJS-$(CONFIG_IDCTDSP) += aarch64/idctdsp_init_aarch64.o
OBJS-$(CONFIG_ME_CMP) += aarch64/me_cmp_init_aarch64.o
diff --git a/libavcodec/aarch64/h264_qpel_daedalus.c b/libavcodec/aarch64/h264_qpel_daedalus.c
new file mode 100644
--- /dev/null
+++ b/libavcodec/aarch64/h264_qpel_daedalus.c
@@ -0,0 +1,50 @@
+/*
+ * H.264 luma qpel mc20 (8x8, horizontal half-pel, 6-tap "put")
+ * — daedalus-fourier substitution shim.
+ *
+ * Routes H264QpelContext.put_h264_qpel_pixels_tab[1][2] through
+ * daedalus_recipe_dispatch_h264_qpel_mc20 instead of
+ * ff_put_h264_qpel8_mc20_neon. The recipe layer picks the substrate
+ * (CPU NEON for cycle 9; QPU not viable — per-block 7.6 ns vs
+ * ~250 ns QPU dispatch floor, see docs/k9_h264qpel_mc20.md).
+ *
+ * Sibling to libavcodec/aarch64/h264_idct_daedalus.c. We keep a
+ * SEPARATE process-global pthread_once context here instead of
+ * sharing the H264DSP one because H264QPEL is its own libavcodec
+ * Makefile module and link order does not guarantee a single .o
+ * owns the ctx symbol. The cost is one extra
+ * daedalus_ctx_create_no_qpu (~µs) per process; daemon and host
+ * processes pay this lazily on first MC call.
+ *
+ * FFmpeg H264QpelContext convention: both dst and src use a SINGLE
+ * stride and `src` already points at the leftmost OUTPUT column
+ * (col 0); the 6-tap filter reads cols -2..+3. This matches
+ * daedalus_recipe_dispatch_h264_qpel_mc20's documented contract
+ * directly, so dst_off = src_off = 0.
+ */
+
+#include <pthread.h>
+#include <stddef.h>
+#include <stdint.h>
+
+#include <daedalus.h>
+
+#include "libavutil/attributes.h"
+
+static daedalus_ctx *g_dctx;
+static pthread_once_t g_dctx_once = PTHREAD_ONCE_INIT;
+
+static void daedalus_ctx_init_once(void)
+{
+ g_dctx = daedalus_ctx_create_no_qpu();
+}
+
+void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
+
+void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
+{
+ static const daedalus_h264_qpel_meta meta = { .dst_off = 0, .src_off = 0 };
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+ daedalus_recipe_dispatch_h264_qpel_mc20(g_dctx, dst, src, (size_t)stride,
+ 1, &meta);
+}
diff --git a/libavcodec/aarch64/h264qpel_init_aarch64.c b/libavcodec/aarch64/h264qpel_init_aarch64.c
--- a/libavcodec/aarch64/h264qpel_init_aarch64.c
+++ b/libavcodec/aarch64/h264qpel_init_aarch64.c
@@ -47,6 +47,8 @@ void ff_put_h264_qpel8_mc00_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t str
void ff_put_h264_qpel8_mc10_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc20_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc30_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
+void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
void ff_put_h264_qpel8_mc01_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc11_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc21_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
@@ -184,7 +186,7 @@ av_cold void ff_h264qpel_init_aarch64(H264QpelContext *c, int bit_depth)
c->put_h264_qpel_pixels_tab[1][ 0] = ff_put_h264_qpel8_mc00_neon;
c->put_h264_qpel_pixels_tab[1][ 1] = ff_put_h264_qpel8_mc10_neon;
- c->put_h264_qpel_pixels_tab[1][ 2] = ff_put_h264_qpel8_mc20_neon;
+ c->put_h264_qpel_pixels_tab[1][ 2] = ff_put_h264_qpel8_mc20_daedalus;
c->put_h264_qpel_pixels_tab[1][ 3] = ff_put_h264_qpel8_mc30_neon;
c->put_h264_qpel_pixels_tab[1][ 4] = ff_put_h264_qpel8_mc01_neon;
c->put_h264_qpel_pixels_tab[1][ 5] = ff_put_h264_qpel8_mc11_neon;
--
2.47.3
@@ -0,0 +1,92 @@
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: claude-noether <claude-noether@noreply.localhost>
Date: Sun, 25 May 2026 12:00:00 +0200
Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 luma-h deblock through daedalus-fourier
Sibling of 0005 (which substituted v_loop_filter_luma). Same
NEON-to-NEON substitution: H264DSPContext.h_loop_filter_luma →
daedalus_recipe_dispatch_h264_deblock_luma_h. The H kernel landed
in daedalus-fourier PR #9 (CPU NEON only — no QPU shader yet).
libavcodec.so ctx is no-QPU per the existing 0003-0005 / 0007
pattern; we cannot assume Vulkan in arbitrary host processes
(firefox-fourier RDD, mpv-fourier, etc.).
Intra (bS=4) h_loop_filter_luma_intra stays on the in-tree NEON .S
code; daedalus_h264_deblock_meta only covers the non-intra path.
An intra-h substitution can land once daedalus-fourier exposes a
dispatch helper (the kernel already exists internally per PR #11).
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 8 H.
---
diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
--- a/libavcodec/aarch64/h264_idct_daedalus.c 2026-05-25 13:09:33.694760715 +0200
+++ libavcodec/aarch64/h264_idct_daedalus.c 2026-05-25 13:09:33.715603719 +0200
@@ -1,9 +1,10 @@
/*
- * H.264 4x4 / 8x8 IDCT + luma-v deblock — daedalus-fourier substitution shims.
+ * H.264 4x4 / 8x8 IDCT + luma v/h deblock — daedalus-fourier substitution shims.
*
* Routes H264DSPContext.idct_add → daedalus_recipe_dispatch_h264_idct4
* H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
* H264DSPContext.v_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_v
+ * H264DSPContext.h_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_h
* instead of the in-tree ff_h264_*_neon assembly. The recipe layer
* picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
* is CPU primary with QPU opportunistic — the ctx below is no-QPU,
@@ -45,6 +46,8 @@
void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
int alpha, int beta, int8_t *tc0);
+void ff_h264_h_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0);
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
{
@@ -84,3 +87,22 @@
daedalus_recipe_dispatch_h264_deblock_luma_v(g_dctx, pix, (size_t)stride,
1, &meta);
}
+
+void ff_h264_h_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0)
+{
+ daedalus_h264_deblock_meta meta = {
+ .dst_off = 0,
+ .alpha = alpha,
+ .beta = beta,
+ };
+ meta.tc0[0] = tc0[0];
+ meta.tc0[1] = tc0[1];
+ meta.tc0[2] = tc0[2];
+ meta.tc0[3] = tc0[3];
+
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+ daedalus_recipe_dispatch_h264_deblock_luma_h(g_dctx, pix, (size_t)stride,
+ 1, &meta);
+}
diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
--- a/libavcodec/aarch64/h264dsp_init_aarch64.c 2026-05-25 13:09:33.695937103 +0200
+++ libavcodec/aarch64/h264dsp_init_aarch64.c 2026-05-25 13:09:33.715541700 +0200
@@ -31,6 +31,8 @@
int alpha, int beta, int8_t *tc0);
void ff_h264_h_loop_filter_luma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta, int8_t *tc0);
+void ff_h264_h_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0);
void ff_h264_v_loop_filter_luma_intra_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta);
void ff_h264_h_loop_filter_luma_intra_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
@@ -117,7 +119,7 @@
if (have_neon(cpu_flags) && bit_depth == 8) {
c->v_loop_filter_luma = ff_h264_v_loop_filter_luma_daedalus;
- c->h_loop_filter_luma = ff_h264_h_loop_filter_luma_neon;
+ c->h_loop_filter_luma = ff_h264_h_loop_filter_luma_daedalus;
c->v_loop_filter_luma_intra= ff_h264_v_loop_filter_luma_intra_neon;
c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_neon;
--
2.47.3
@@ -0,0 +1,127 @@
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: claude-noether <claude-noether@noreply.localhost>
Date: Sun, 25 May 2026 12:00:00 +0200
Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 chroma v/h deblock through daedalus-fourier
Chroma siblings of 0005 (luma_v) and 0008 (luma_h). Same
NEON-to-NEON pattern via the daedalus recipe layer:
H264DSPContext.v_loop_filter_chroma →
daedalus_recipe_dispatch_h264_deblock_chroma_v
H264DSPContext.h_loop_filter_chroma →
daedalus_recipe_dispatch_h264_deblock_chroma_h
Both kernels landed in daedalus-fourier PR #10. Recipe table
routes AUTO to CPU NEON (no chroma QPU shaders yet), so this
is plumbing-only and stays bit-exact against the in-tree NEON.
Intra chroma (bS=4) loop filters remain on in-tree NEON;
daedalus_h264_deblock_meta covers the non-intra (bS<4) path.
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 8 chroma.
---
diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
--- a/libavcodec/aarch64/h264_idct_daedalus.c 2026-05-25 13:15:45.995368233 +0200
+++ libavcodec/aarch64/h264_idct_daedalus.c 2026-05-25 13:15:46.015839177 +0200
@@ -1,10 +1,12 @@
/*
- * H.264 4x4 / 8x8 IDCT + luma v/h deblock — daedalus-fourier substitution shims.
+ * H.264 4x4 / 8x8 IDCT + luma v/h + chroma v/h deblock — daedalus-fourier substitution shims.
*
* Routes H264DSPContext.idct_add → daedalus_recipe_dispatch_h264_idct4
* H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
- * H264DSPContext.v_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_v
- * H264DSPContext.h_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_h
+ * H264DSPContext.v_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_v
+ * H264DSPContext.h_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_h
+ * H264DSPContext.v_loop_filter_chroma → daedalus_recipe_dispatch_h264_deblock_chroma_v
+ * H264DSPContext.h_loop_filter_chroma → daedalus_recipe_dispatch_h264_deblock_chroma_h
* instead of the in-tree ff_h264_*_neon assembly. The recipe layer
* picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
* is CPU primary with QPU opportunistic — the ctx below is no-QPU,
@@ -48,6 +50,10 @@
int alpha, int beta, int8_t *tc0);
void ff_h264_h_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
int alpha, int beta, int8_t *tc0);
+void ff_h264_v_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0);
+void ff_h264_h_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0);
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
{
@@ -106,3 +112,41 @@
daedalus_recipe_dispatch_h264_deblock_luma_h(g_dctx, pix, (size_t)stride,
1, &meta);
}
+
+void ff_h264_v_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0)
+{
+ daedalus_h264_deblock_meta meta = {
+ .dst_off = 0,
+ .alpha = alpha,
+ .beta = beta,
+ };
+ meta.tc0[0] = tc0[0];
+ meta.tc0[1] = tc0[1];
+ meta.tc0[2] = tc0[2];
+ meta.tc0[3] = tc0[3];
+
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+ daedalus_recipe_dispatch_h264_deblock_chroma_v(g_dctx, pix, (size_t)stride,
+ 1, &meta);
+}
+
+void ff_h264_h_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0)
+{
+ daedalus_h264_deblock_meta meta = {
+ .dst_off = 0,
+ .alpha = alpha,
+ .beta = beta,
+ };
+ meta.tc0[0] = tc0[0];
+ meta.tc0[1] = tc0[1];
+ meta.tc0[2] = tc0[2];
+ meta.tc0[3] = tc0[3];
+
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+ daedalus_recipe_dispatch_h264_deblock_chroma_h(g_dctx, pix, (size_t)stride,
+ 1, &meta);
+}
diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
--- a/libavcodec/aarch64/h264dsp_init_aarch64.c 2026-05-25 13:15:45.996482360 +0200
+++ libavcodec/aarch64/h264dsp_init_aarch64.c 2026-05-25 13:15:46.025604910 +0200
@@ -39,8 +39,12 @@
int beta);
void ff_h264_v_loop_filter_chroma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta, int8_t *tc0);
+void ff_h264_v_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0);
void ff_h264_h_loop_filter_chroma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta, int8_t *tc0);
+void ff_h264_h_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0);
void ff_h264_h_loop_filter_chroma422_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta, int8_t *tc0);
void ff_h264_v_loop_filter_chroma_intra_neon(uint8_t *pix, ptrdiff_t stride,
@@ -123,11 +127,11 @@
c->v_loop_filter_luma_intra= ff_h264_v_loop_filter_luma_intra_neon;
c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_neon;
- c->v_loop_filter_chroma = ff_h264_v_loop_filter_chroma_neon;
+ c->v_loop_filter_chroma = ff_h264_v_loop_filter_chroma_daedalus;
c->v_loop_filter_chroma_intra = ff_h264_v_loop_filter_chroma_intra_neon;
if (chroma_format_idc <= 1) {
- c->h_loop_filter_chroma = ff_h264_h_loop_filter_chroma_neon;
+ c->h_loop_filter_chroma = ff_h264_h_loop_filter_chroma_daedalus;
c->h_loop_filter_chroma_intra = ff_h264_h_loop_filter_chroma_intra_neon;
c->h_loop_filter_chroma_mbaff_intra = ff_h264_h_loop_filter_chroma_mbaff_intra_neon;
} else {
--
2.47.3
@@ -0,0 +1,126 @@
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: claude-noether <claude-noether@noreply.localhost>
Date: Sun, 25 May 2026 12:30:00 +0200
Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 luma intra deblock through daedalus-fourier
Adds the bS=4 intra-strength variants of the already-substituted
luma_v / luma_h deblock (0005, 0008). Intra MBs and certain
inter-MB edges (4x4 transform boundaries inside an Intra_NxN
neighbour) force boundary strength to 4 per H.264 §8.7.2.1.
H264DSPContext.v_loop_filter_luma_intra →
daedalus_recipe_dispatch_h264_deblock_luma_v_intra
H264DSPContext.h_loop_filter_luma_intra →
daedalus_recipe_dispatch_h264_deblock_luma_h_intra
Both kernels landed in daedalus-fourier PR #11. Recipe table
routes AUTO to CPU NEON (no intra QPU shaders yet) — plumbing-
only NEON-to-NEON via daedalus, bit-exact against the in-tree
FFmpeg NEON path.
Signature differs from bS<4: no tc0 argument. The wrapper
passes daedalus_h264_deblock_meta with alpha/beta set; tc0[] is
ignored by the intra dispatch (bS=4 hardcodes the strength).
Chroma intra variants are deferred to a follow-up PR because the
chroma path has a 4:2:0 / 4:2:2 split (chroma_format_idc gating)
that needs explicit conditional substitution to avoid running
the 4:2:0-only daedalus dispatch on 4:2:2 chroma.
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 8 intra.
---
diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
--- a/libavcodec/aarch64/h264_idct_daedalus.c 2026-05-25 13:18:54.992244965 +0200
+++ libavcodec/aarch64/h264_idct_daedalus.c 2026-05-25 13:20:12.338122217 +0200
@@ -1,5 +1,5 @@
/*
- * H.264 4x4 / 8x8 IDCT + luma v/h + chroma v/h deblock — daedalus-fourier substitution shims.
+ * H.264 4x4 / 8x8 IDCT + luma v/h (inter + intra) + chroma v/h deblock — daedalus-fourier substitution shims.
*
* Routes H264DSPContext.idct_add → daedalus_recipe_dispatch_h264_idct4
* H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
@@ -7,6 +7,8 @@
* H264DSPContext.h_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_h
* H264DSPContext.v_loop_filter_chroma → daedalus_recipe_dispatch_h264_deblock_chroma_v
* H264DSPContext.h_loop_filter_chroma → daedalus_recipe_dispatch_h264_deblock_chroma_h
+ * H264DSPContext.v_loop_filter_luma_intra → daedalus_recipe_dispatch_h264_deblock_luma_v_intra
+ * H264DSPContext.h_loop_filter_luma_intra → daedalus_recipe_dispatch_h264_deblock_luma_h_intra
* instead of the in-tree ff_h264_*_neon assembly. The recipe layer
* picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
* is CPU primary with QPU opportunistic — the ctx below is no-QPU,
@@ -54,6 +56,10 @@
int alpha, int beta, int8_t *tc0);
void ff_h264_h_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
int alpha, int beta, int8_t *tc0);
+void ff_h264_v_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta);
+void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta);
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
{
@@ -150,3 +156,34 @@
daedalus_recipe_dispatch_h264_deblock_chroma_h(g_dctx, pix, (size_t)stride,
1, &meta);
}
+
+void ff_h264_v_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta)
+{
+ daedalus_h264_deblock_meta meta = {
+ .dst_off = 0,
+ .alpha = alpha,
+ .beta = beta,
+ };
+ /* tc0[] is ignored by the intra-strength dispatch (bS=4 hardcodes the strength). */
+
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+ daedalus_recipe_dispatch_h264_deblock_luma_v_intra(g_dctx, pix, (size_t)stride,
+ 1, &meta);
+}
+
+void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta)
+{
+ daedalus_h264_deblock_meta meta = {
+ .dst_off = 0,
+ .alpha = alpha,
+ .beta = beta,
+ };
+
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+ daedalus_recipe_dispatch_h264_deblock_luma_h_intra(g_dctx, pix, (size_t)stride,
+ 1, &meta);
+}
diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
--- a/libavcodec/aarch64/h264dsp_init_aarch64.c 2026-05-25 13:18:54.993349573 +0200
+++ libavcodec/aarch64/h264dsp_init_aarch64.c 2026-05-25 13:20:12.338265830 +0200
@@ -35,8 +35,12 @@
int alpha, int beta, int8_t *tc0);
void ff_h264_v_loop_filter_luma_intra_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta);
+void ff_h264_v_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta);
void ff_h264_h_loop_filter_luma_intra_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta);
+void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta);
void ff_h264_v_loop_filter_chroma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta, int8_t *tc0);
void ff_h264_v_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
@@ -124,8 +128,8 @@
if (have_neon(cpu_flags) && bit_depth == 8) {
c->v_loop_filter_luma = ff_h264_v_loop_filter_luma_daedalus;
c->h_loop_filter_luma = ff_h264_h_loop_filter_luma_daedalus;
- c->v_loop_filter_luma_intra= ff_h264_v_loop_filter_luma_intra_neon;
- c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_neon;
+ c->v_loop_filter_luma_intra= ff_h264_v_loop_filter_luma_intra_daedalus;
+ c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_daedalus;
c->v_loop_filter_chroma = ff_h264_v_loop_filter_chroma_daedalus;
c->v_loop_filter_chroma_intra = ff_h264_v_loop_filter_chroma_intra_neon;
--
2.47.3
@@ -0,0 +1,101 @@
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: claude-noether <claude-noether@noreply.localhost>
Date: Sun, 25 May 2026 13:00:00 +0200
Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 chroma DC Hadamard through daedalus-fourier
Substitutes H264DSPContext.chroma_dc_dequant_idct in the
4:2:0 / bit_depth=8 init path with a wrapper that composes
the daedalus chroma DC Hadamard primitive (fourier PR #25)
with qmul scaling FFmpeg does in one fused function.
Bit-exact against ff_h264_chroma_dc_dequant_idct_8_c.
Hadamard correctness gated by fourier PR #23 test suite.
4:2:2 chroma stays on the in-tree 422 variant (same
gating shape as 0009 chroma deblock substitution).
Requires daedalus-fourier commit b9f9ff2 or later (PR #25
exposing the public Hadamard symbol). Pin bumps in PKGBUILD
and build-deb.sh come in the same commit.
---
diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
--- a/libavcodec/aarch64/h264_idct_daedalus.c 2026-05-25 13:38:32.019491484 +0200
+++ libavcodec/aarch64/h264_idct_daedalus.c 2026-05-25 13:38:32.033821507 +0200
@@ -1,5 +1,5 @@
/*
- * H.264 4x4 / 8x8 IDCT + luma v/h (inter + intra) + chroma v/h deblock — daedalus-fourier substitution shims.
+ * H.264 4x4 / 8x8 IDCT + luma v/h (inter+intra) + chroma v/h deblock + chroma DC Hadamard — daedalus-fourier substitution shims.
*
* Routes H264DSPContext.idct_add → daedalus_recipe_dispatch_h264_idct4
* H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
@@ -9,6 +9,7 @@
* H264DSPContext.h_loop_filter_chroma → daedalus_recipe_dispatch_h264_deblock_chroma_h
* H264DSPContext.v_loop_filter_luma_intra → daedalus_recipe_dispatch_h264_deblock_luma_v_intra
* H264DSPContext.h_loop_filter_luma_intra → daedalus_recipe_dispatch_h264_deblock_luma_h_intra
+ * H264DSPContext.chroma_dc_dequant_idct → daedalus_h264_chroma_dc_hadamard_2x2 + caller-side qmul
* instead of the in-tree ff_h264_*_neon assembly. The recipe layer
* picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
* is CPU primary with QPU opportunistic — the ctx below is no-QPU,
@@ -60,6 +61,7 @@
int alpha, int beta);
void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
int alpha, int beta);
+void ff_h264_chroma_dc_dequant_idct_daedalus(int16_t *block, int qmul);
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
{
@@ -187,3 +189,32 @@
daedalus_recipe_dispatch_h264_deblock_luma_h_intra(g_dctx, pix, (size_t)stride,
1, &meta);
}
+
+/* Composes daedalus_h264_chroma_dc_hadamard_2x2 with the qmul scaling
+ * that FFmpeg's reference does in one fused function (h264idct_template.c
+ * ff_h264_chroma_dc_dequant_idct).
+ *
+ * The 4 DC coefficients are scattered across the per-MB coefficient
+ * buffer at offsets [r*stride + c*xStride] (stride=32, xStride=16).
+ * Extract into a contiguous int16[4], run the Hadamard, then apply
+ * the qmul scale and write back to the original positions.
+ *
+ * No daedalus ctx needed; the Hadamard is a pure stateless primitive.
+ */
+void ff_h264_chroma_dc_dequant_idct_daedalus(int16_t *block, int qmul)
+{
+ enum { stride = 32, xStride = 16 };
+ int16_t dc[4];
+
+ dc[0] = block[stride*0 + xStride*0];
+ dc[1] = block[stride*0 + xStride*1];
+ dc[2] = block[stride*1 + xStride*0];
+ dc[3] = block[stride*1 + xStride*1];
+
+ daedalus_h264_chroma_dc_hadamard_2x2(dc);
+
+ block[stride*0 + xStride*0] = (int16_t)((int)dc[0] * qmul >> 7);
+ block[stride*0 + xStride*1] = (int16_t)((int)dc[1] * qmul >> 7);
+ block[stride*1 + xStride*0] = (int16_t)((int)dc[2] * qmul >> 7);
+ block[stride*1 + xStride*1] = (int16_t)((int)dc[3] * qmul >> 7);
+}
diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
--- a/libavcodec/aarch64/h264dsp_init_aarch64.c 2026-05-25 13:38:32.020346459 +0200
+++ libavcodec/aarch64/h264dsp_init_aarch64.c 2026-05-25 13:38:32.033909804 +0200
@@ -41,6 +41,7 @@
int beta);
void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
int alpha, int beta);
+void ff_h264_chroma_dc_dequant_idct_daedalus(int16_t *block, int qmul);
void ff_h264_v_loop_filter_chroma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta, int8_t *tc0);
void ff_h264_v_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
@@ -135,6 +136,7 @@
c->v_loop_filter_chroma_intra = ff_h264_v_loop_filter_chroma_intra_neon;
if (chroma_format_idc <= 1) {
+ c->chroma_dc_dequant_idct = ff_h264_chroma_dc_dequant_idct_daedalus;
c->h_loop_filter_chroma = ff_h264_h_loop_filter_chroma_daedalus;
c->h_loop_filter_chroma_intra = ff_h264_h_loop_filter_chroma_intra_neon;
c->h_loop_filter_chroma_mbaff_intra = ff_h264_h_loop_filter_chroma_mbaff_intra_neon;
--
2.47.3
@@ -0,0 +1,245 @@
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: claude-noether <claude-noether@noreply.localhost>
Date: Sun, 25 May 2026 14:00:00 +0200
Subject: [PATCH] avcodec/aarch64/h264qpel: route remaining qpel 8x8 positions through daedalus-fourier
Closes the H.264 qpel substitution. Extends 0007 (which routed only
mc20 put_) to ALL 15 useful positions in BOTH the put_ and avg_
tables, skipping mc00 (integer copy / pointer-only fast path).
29 substitutions total: 14 new put_ + 15 avg_. Each is a uniform
wrapper around daedalus_recipe_dispatch_h264_qpel_{avg_,}mcXY exposed
by daedalus-fourier PRs #15-#20.
All recipe-table entries route AUTO to CPU NEON (no QPU shaders
for any qpel position other than mc20 yet), so this is plumbing-only
NEON-to-NEON — bit-exact against the in-tree ff_*_h264_qpel8_*_neon
path.
16x16 qpel tables ([0][...]) stay on the in-tree NEON. daedalus
only exposes 8x8 today; 16x16 substitution can land once fourier
provides those variants (likely just dispatching the 8x8 path four
times with shifted dst/src offsets).
Refs reauktion/daedalus-v4l2#11 — substitution arc qpel buildout.
---
diff --git a/libavcodec/aarch64/h264_qpel_daedalus.c b/libavcodec/aarch64/h264_qpel_daedalus.c
--- a/libavcodec/aarch64/h264_qpel_daedalus.c 2026-05-25 14:05:05.789298250 +0200
+++ libavcodec/aarch64/h264_qpel_daedalus.c 2026-05-25 14:05:05.818358374 +0200
@@ -1,10 +1,13 @@
/*
- * H.264 luma qpel mc20 (8x8, horizontal half-pel, 6-tap "put")
- * — daedalus-fourier substitution shim.
+ * H.264 luma qpel 8x8 — daedalus-fourier substitution shims (put_ + avg_).
*
- * Routes H264QpelContext.put_h264_qpel_pixels_tab[1][2] through
- * daedalus_recipe_dispatch_h264_qpel_mc20 instead of
- * ff_put_h264_qpel8_mc20_neon. The recipe layer picks the substrate
+ * Routes ALL 15 useful positions in H264QpelContext's 8x8 put_ and
+ * avg_ tables through daedalus_recipe_dispatch_h264_qpel_mc{XY}
+ * (skipping mc00 which is integer copy / FFmpeg's pointer-only fast
+ * path). Plumbing-only NEON-by-recipe — daedalus-fourier PRs #15-#20
+ * exposed each variant via the same dispatch signature, so the
+ * substitution is a uniform macro across put_/avg_ and across all
+ * 15 mc positions. The recipe layer picks the substrate
* (CPU NEON for cycle 9; QPU not viable — per-block 7.6 ns vs
* ~250 ns QPU dispatch floor, see docs/k9_h264qpel_mc20.md).
*
@@ -48,3 +51,53 @@
daedalus_recipe_dispatch_h264_qpel_mc20(g_dctx, dst, src, (size_t)stride,
1, &meta);
}
+
+
+/* All other 8x8 qpel positions follow the same dispatch shape as mc20
+ * above. The macro collapses ~600 LOC of one-wrapper-per-variant
+ * boilerplate (29 variants total: 14 put_ + 15 avg_). */
+#define DEFINE_QPEL_WRAPPER(type, suffix, dispatch_fn) \
+void ff_ ## type ## _h264_qpel8_ ## suffix ## _daedalus(uint8_t *dst, \
+ const uint8_t *src, ptrdiff_t stride); \
+void ff_ ## type ## _h264_qpel8_ ## suffix ## _daedalus(uint8_t *dst, \
+ const uint8_t *src, ptrdiff_t stride) \
+{ \
+ static const daedalus_h264_qpel_meta meta = { .dst_off = 0, .src_off = 0 }; \
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once); \
+ dispatch_fn(g_dctx, dst, src, (size_t)stride, 1, &meta); \
+}
+
+/* put_ variants (mc20 stays on the explicit definition above). */
+DEFINE_QPEL_WRAPPER(put, mc10, daedalus_recipe_dispatch_h264_qpel_mc10)
+DEFINE_QPEL_WRAPPER(put, mc30, daedalus_recipe_dispatch_h264_qpel_mc30)
+DEFINE_QPEL_WRAPPER(put, mc01, daedalus_recipe_dispatch_h264_qpel_mc01)
+DEFINE_QPEL_WRAPPER(put, mc11, daedalus_recipe_dispatch_h264_qpel_mc11)
+DEFINE_QPEL_WRAPPER(put, mc21, daedalus_recipe_dispatch_h264_qpel_mc21)
+DEFINE_QPEL_WRAPPER(put, mc31, daedalus_recipe_dispatch_h264_qpel_mc31)
+DEFINE_QPEL_WRAPPER(put, mc02, daedalus_recipe_dispatch_h264_qpel_mc02)
+DEFINE_QPEL_WRAPPER(put, mc12, daedalus_recipe_dispatch_h264_qpel_mc12)
+DEFINE_QPEL_WRAPPER(put, mc22, daedalus_recipe_dispatch_h264_qpel_mc22)
+DEFINE_QPEL_WRAPPER(put, mc32, daedalus_recipe_dispatch_h264_qpel_mc32)
+DEFINE_QPEL_WRAPPER(put, mc03, daedalus_recipe_dispatch_h264_qpel_mc03)
+DEFINE_QPEL_WRAPPER(put, mc13, daedalus_recipe_dispatch_h264_qpel_mc13)
+DEFINE_QPEL_WRAPPER(put, mc23, daedalus_recipe_dispatch_h264_qpel_mc23)
+DEFINE_QPEL_WRAPPER(put, mc33, daedalus_recipe_dispatch_h264_qpel_mc33)
+
+/* avg_ variants — all 15 useful positions. */
+DEFINE_QPEL_WRAPPER(avg, mc10, daedalus_recipe_dispatch_h264_qpel_avg_mc10)
+DEFINE_QPEL_WRAPPER(avg, mc20, daedalus_recipe_dispatch_h264_qpel_avg_mc20)
+DEFINE_QPEL_WRAPPER(avg, mc30, daedalus_recipe_dispatch_h264_qpel_avg_mc30)
+DEFINE_QPEL_WRAPPER(avg, mc01, daedalus_recipe_dispatch_h264_qpel_avg_mc01)
+DEFINE_QPEL_WRAPPER(avg, mc11, daedalus_recipe_dispatch_h264_qpel_avg_mc11)
+DEFINE_QPEL_WRAPPER(avg, mc21, daedalus_recipe_dispatch_h264_qpel_avg_mc21)
+DEFINE_QPEL_WRAPPER(avg, mc31, daedalus_recipe_dispatch_h264_qpel_avg_mc31)
+DEFINE_QPEL_WRAPPER(avg, mc02, daedalus_recipe_dispatch_h264_qpel_avg_mc02)
+DEFINE_QPEL_WRAPPER(avg, mc12, daedalus_recipe_dispatch_h264_qpel_avg_mc12)
+DEFINE_QPEL_WRAPPER(avg, mc22, daedalus_recipe_dispatch_h264_qpel_avg_mc22)
+DEFINE_QPEL_WRAPPER(avg, mc32, daedalus_recipe_dispatch_h264_qpel_avg_mc32)
+DEFINE_QPEL_WRAPPER(avg, mc03, daedalus_recipe_dispatch_h264_qpel_avg_mc03)
+DEFINE_QPEL_WRAPPER(avg, mc13, daedalus_recipe_dispatch_h264_qpel_avg_mc13)
+DEFINE_QPEL_WRAPPER(avg, mc23, daedalus_recipe_dispatch_h264_qpel_avg_mc23)
+DEFINE_QPEL_WRAPPER(avg, mc33, daedalus_recipe_dispatch_h264_qpel_avg_mc33)
+
+#undef DEFINE_QPEL_WRAPPER
diff --git a/libavcodec/aarch64/h264qpel_init_aarch64.c b/libavcodec/aarch64/h264qpel_init_aarch64.c
--- a/libavcodec/aarch64/h264qpel_init_aarch64.c 2026-05-25 14:05:05.790403989 +0200
+++ libavcodec/aarch64/h264qpel_init_aarch64.c 2026-05-25 14:05:05.819136071 +0200
@@ -50,6 +50,64 @@
void ff_put_h264_qpel8_mc30_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src,
ptrdiff_t stride);
+void ff_put_h264_qpel8_mc10_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_put_h264_qpel8_mc30_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_put_h264_qpel8_mc01_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_put_h264_qpel8_mc11_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_put_h264_qpel8_mc21_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_put_h264_qpel8_mc31_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_put_h264_qpel8_mc02_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_put_h264_qpel8_mc12_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_put_h264_qpel8_mc22_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_put_h264_qpel8_mc32_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_put_h264_qpel8_mc03_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_put_h264_qpel8_mc13_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_put_h264_qpel8_mc23_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_put_h264_qpel8_mc33_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc10_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc30_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc01_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc11_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc21_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc31_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc02_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc12_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc22_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc32_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc03_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc13_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc23_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc33_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
void ff_put_h264_qpel8_mc01_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc11_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc21_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
@@ -164,21 +222,21 @@
c->put_h264_qpel_pixels_tab[0][15] = ff_put_h264_qpel16_mc33_neon;
c->put_h264_qpel_pixels_tab[1][ 0] = ff_put_h264_qpel8_mc00_neon;
- c->put_h264_qpel_pixels_tab[1][ 1] = ff_put_h264_qpel8_mc10_neon;
+ c->put_h264_qpel_pixels_tab[1][ 1] = ff_put_h264_qpel8_mc10_daedalus;
c->put_h264_qpel_pixels_tab[1][ 2] = ff_put_h264_qpel8_mc20_daedalus;
- c->put_h264_qpel_pixels_tab[1][ 3] = ff_put_h264_qpel8_mc30_neon;
- c->put_h264_qpel_pixels_tab[1][ 4] = ff_put_h264_qpel8_mc01_neon;
- c->put_h264_qpel_pixels_tab[1][ 5] = ff_put_h264_qpel8_mc11_neon;
- c->put_h264_qpel_pixels_tab[1][ 6] = ff_put_h264_qpel8_mc21_neon;
- c->put_h264_qpel_pixels_tab[1][ 7] = ff_put_h264_qpel8_mc31_neon;
- c->put_h264_qpel_pixels_tab[1][ 8] = ff_put_h264_qpel8_mc02_neon;
- c->put_h264_qpel_pixels_tab[1][ 9] = ff_put_h264_qpel8_mc12_neon;
- c->put_h264_qpel_pixels_tab[1][10] = ff_put_h264_qpel8_mc22_neon;
- c->put_h264_qpel_pixels_tab[1][11] = ff_put_h264_qpel8_mc32_neon;
- c->put_h264_qpel_pixels_tab[1][12] = ff_put_h264_qpel8_mc03_neon;
- c->put_h264_qpel_pixels_tab[1][13] = ff_put_h264_qpel8_mc13_neon;
- c->put_h264_qpel_pixels_tab[1][14] = ff_put_h264_qpel8_mc23_neon;
- c->put_h264_qpel_pixels_tab[1][15] = ff_put_h264_qpel8_mc33_neon;
+ c->put_h264_qpel_pixels_tab[1][ 3] = ff_put_h264_qpel8_mc30_daedalus;
+ c->put_h264_qpel_pixels_tab[1][ 4] = ff_put_h264_qpel8_mc01_daedalus;
+ c->put_h264_qpel_pixels_tab[1][ 5] = ff_put_h264_qpel8_mc11_daedalus;
+ c->put_h264_qpel_pixels_tab[1][ 6] = ff_put_h264_qpel8_mc21_daedalus;
+ c->put_h264_qpel_pixels_tab[1][ 7] = ff_put_h264_qpel8_mc31_daedalus;
+ c->put_h264_qpel_pixels_tab[1][ 8] = ff_put_h264_qpel8_mc02_daedalus;
+ c->put_h264_qpel_pixels_tab[1][ 9] = ff_put_h264_qpel8_mc12_daedalus;
+ c->put_h264_qpel_pixels_tab[1][10] = ff_put_h264_qpel8_mc22_daedalus;
+ c->put_h264_qpel_pixels_tab[1][11] = ff_put_h264_qpel8_mc32_daedalus;
+ c->put_h264_qpel_pixels_tab[1][12] = ff_put_h264_qpel8_mc03_daedalus;
+ c->put_h264_qpel_pixels_tab[1][13] = ff_put_h264_qpel8_mc13_daedalus;
+ c->put_h264_qpel_pixels_tab[1][14] = ff_put_h264_qpel8_mc23_daedalus;
+ c->put_h264_qpel_pixels_tab[1][15] = ff_put_h264_qpel8_mc33_daedalus;
c->avg_h264_qpel_pixels_tab[0][ 0] = ff_avg_h264_qpel16_mc00_neon;
c->avg_h264_qpel_pixels_tab[0][ 1] = ff_avg_h264_qpel16_mc10_neon;
@@ -198,21 +256,21 @@
c->avg_h264_qpel_pixels_tab[0][15] = ff_avg_h264_qpel16_mc33_neon;
c->avg_h264_qpel_pixels_tab[1][ 0] = ff_avg_h264_qpel8_mc00_neon;
- c->avg_h264_qpel_pixels_tab[1][ 1] = ff_avg_h264_qpel8_mc10_neon;
- c->avg_h264_qpel_pixels_tab[1][ 2] = ff_avg_h264_qpel8_mc20_neon;
- c->avg_h264_qpel_pixels_tab[1][ 3] = ff_avg_h264_qpel8_mc30_neon;
- c->avg_h264_qpel_pixels_tab[1][ 4] = ff_avg_h264_qpel8_mc01_neon;
- c->avg_h264_qpel_pixels_tab[1][ 5] = ff_avg_h264_qpel8_mc11_neon;
- c->avg_h264_qpel_pixels_tab[1][ 6] = ff_avg_h264_qpel8_mc21_neon;
- c->avg_h264_qpel_pixels_tab[1][ 7] = ff_avg_h264_qpel8_mc31_neon;
- c->avg_h264_qpel_pixels_tab[1][ 8] = ff_avg_h264_qpel8_mc02_neon;
- c->avg_h264_qpel_pixels_tab[1][ 9] = ff_avg_h264_qpel8_mc12_neon;
- c->avg_h264_qpel_pixels_tab[1][10] = ff_avg_h264_qpel8_mc22_neon;
- c->avg_h264_qpel_pixels_tab[1][11] = ff_avg_h264_qpel8_mc32_neon;
- c->avg_h264_qpel_pixels_tab[1][12] = ff_avg_h264_qpel8_mc03_neon;
- c->avg_h264_qpel_pixels_tab[1][13] = ff_avg_h264_qpel8_mc13_neon;
- c->avg_h264_qpel_pixels_tab[1][14] = ff_avg_h264_qpel8_mc23_neon;
- c->avg_h264_qpel_pixels_tab[1][15] = ff_avg_h264_qpel8_mc33_neon;
+ c->avg_h264_qpel_pixels_tab[1][ 1] = ff_avg_h264_qpel8_mc10_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][ 2] = ff_avg_h264_qpel8_mc20_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][ 3] = ff_avg_h264_qpel8_mc30_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][ 4] = ff_avg_h264_qpel8_mc01_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][ 5] = ff_avg_h264_qpel8_mc11_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][ 6] = ff_avg_h264_qpel8_mc21_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][ 7] = ff_avg_h264_qpel8_mc31_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][ 8] = ff_avg_h264_qpel8_mc02_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][ 9] = ff_avg_h264_qpel8_mc12_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][10] = ff_avg_h264_qpel8_mc22_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][11] = ff_avg_h264_qpel8_mc32_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][12] = ff_avg_h264_qpel8_mc03_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][13] = ff_avg_h264_qpel8_mc13_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][14] = ff_avg_h264_qpel8_mc23_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][15] = ff_avg_h264_qpel8_mc33_daedalus;
} else if (have_neon(cpu_flags) && bit_depth == 10) {
c->put_h264_qpel_pixels_tab[0][ 1] = ff_put_h264_qpel16_mc10_neon_10;
c->put_h264_qpel_pixels_tab[0][ 2] = ff_put_h264_qpel16_mc20_neon_10;
--
2.47.3
@@ -0,0 +1,120 @@
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: claude-noether <claude-noether@noreply.localhost>
Date: Sun, 25 May 2026 14:30:00 +0200
Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 chroma intra deblock (4:2:0) through daedalus-fourier
Substitutes c->v_loop_filter_chroma_intra and c->h_loop_filter_chroma_intra
with daedalus wrappers in the bit_depth=8 / chroma_format_idc<=1 (4:2:0)
branch. 4:2:2 stays on the in-tree NEON path (the daedalus chroma intra
dispatch is 4:2:0-only).
The fourier dispatches were exposed in PR #11 (DEFINE_INTRA_DISPATCH
macro generates the public daedalus_dispatch_h264_deblock_chroma_*_intra
symbols + recipe wrappers).
Re-architects the chroma init: v_loop_filter_chroma_intra was previously
assigned unconditionally to the NEON variant (which works for both 4:2:0
and 4:2:2). We now assign it INSIDE both branches of the chroma_format_idc
conditional, with the 4:2:0 branch picking daedalus and the 4:2:2 branch
keeping NEON. No regression for 4:2:2 streams.
Same NEON-to-NEON via recipe shape as 0010 luma intra.
Refs reauktion/daedalus-v4l2#11 — substitution arc chroma intra.
---
diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
--- a/libavcodec/aarch64/h264_idct_daedalus.c 2026-05-25 14:21:08.267156263 +0200
+++ libavcodec/aarch64/h264_idct_daedalus.c 2026-05-25 14:21:08.287745931 +0200
@@ -1,5 +1,5 @@
/*
- * H.264 4x4 / 8x8 IDCT + luma v/h (inter+intra) + chroma v/h deblock + chroma DC Hadamard — daedalus-fourier substitution shims.
+ * H.264 4x4 / 8x8 IDCT + luma v/h (inter+intra) + chroma v/h (inter+intra) deblock + chroma DC Hadamard — daedalus-fourier substitution shims.
*
* Routes H264DSPContext.idct_add → daedalus_recipe_dispatch_h264_idct4
* H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
@@ -9,6 +9,8 @@
* H264DSPContext.h_loop_filter_chroma → daedalus_recipe_dispatch_h264_deblock_chroma_h
* H264DSPContext.v_loop_filter_luma_intra → daedalus_recipe_dispatch_h264_deblock_luma_v_intra
* H264DSPContext.h_loop_filter_luma_intra → daedalus_recipe_dispatch_h264_deblock_luma_h_intra
+ * H264DSPContext.v_loop_filter_chroma_intra → daedalus_recipe_dispatch_h264_deblock_chroma_v_intra
+ * H264DSPContext.h_loop_filter_chroma_intra → daedalus_recipe_dispatch_h264_deblock_chroma_h_intra
* H264DSPContext.chroma_dc_dequant_idct → daedalus_h264_chroma_dc_hadamard_2x2 + caller-side qmul
* instead of the in-tree ff_h264_*_neon assembly. The recipe layer
* picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
@@ -61,6 +63,10 @@
int alpha, int beta);
void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
int alpha, int beta);
+void ff_h264_v_loop_filter_chroma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta);
+void ff_h264_h_loop_filter_chroma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta);
void ff_h264_chroma_dc_dequant_idct_daedalus(int16_t *block, int qmul);
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
@@ -218,3 +224,30 @@
block[stride*1 + xStride*0] = (int16_t)((int)dc[2] * qmul >> 7);
block[stride*1 + xStride*1] = (int16_t)((int)dc[3] * qmul >> 7);
}
+
+void ff_h264_v_loop_filter_chroma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta)
+{
+ daedalus_h264_deblock_meta meta = {
+ .dst_off = 0,
+ .alpha = alpha,
+ .beta = beta,
+ };
+ /* tc0[] unused for intra (bS=4 hardcodes the strength). */
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+ daedalus_recipe_dispatch_h264_deblock_chroma_v_intra(g_dctx, pix, (size_t)stride,
+ 1, &meta);
+}
+
+void ff_h264_h_loop_filter_chroma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta)
+{
+ daedalus_h264_deblock_meta meta = {
+ .dst_off = 0,
+ .alpha = alpha,
+ .beta = beta,
+ };
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+ daedalus_recipe_dispatch_h264_deblock_chroma_h_intra(g_dctx, pix, (size_t)stride,
+ 1, &meta);
+}
diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
--- a/libavcodec/aarch64/h264dsp_init_aarch64.c 2026-05-25 14:21:08.268311057 +0200
+++ libavcodec/aarch64/h264dsp_init_aarch64.c 2026-05-25 14:21:08.287886563 +0200
@@ -42,6 +42,10 @@
void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
int alpha, int beta);
void ff_h264_chroma_dc_dequant_idct_daedalus(int16_t *block, int qmul);
+void ff_h264_v_loop_filter_chroma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta);
+void ff_h264_h_loop_filter_chroma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta);
void ff_h264_v_loop_filter_chroma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta, int8_t *tc0);
void ff_h264_v_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
@@ -133,14 +137,15 @@
c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_daedalus;
c->v_loop_filter_chroma = ff_h264_v_loop_filter_chroma_daedalus;
- c->v_loop_filter_chroma_intra = ff_h264_v_loop_filter_chroma_intra_neon;
if (chroma_format_idc <= 1) {
c->chroma_dc_dequant_idct = ff_h264_chroma_dc_dequant_idct_daedalus;
+ c->v_loop_filter_chroma_intra = ff_h264_v_loop_filter_chroma_intra_daedalus;
c->h_loop_filter_chroma = ff_h264_h_loop_filter_chroma_daedalus;
- c->h_loop_filter_chroma_intra = ff_h264_h_loop_filter_chroma_intra_neon;
+ c->h_loop_filter_chroma_intra = ff_h264_h_loop_filter_chroma_intra_daedalus;
c->h_loop_filter_chroma_mbaff_intra = ff_h264_h_loop_filter_chroma_mbaff_intra_neon;
} else {
+ c->v_loop_filter_chroma_intra = ff_h264_v_loop_filter_chroma_intra_neon;
c->h_loop_filter_chroma = ff_h264_h_loop_filter_chroma422_neon;
c->h_loop_filter_chroma_mbaff = ff_h264_h_loop_filter_chroma_neon;
c->h_loop_filter_chroma_intra = ff_h264_h_loop_filter_chroma422_intra_neon;
--
2.47.3
+53 -3
View File
@@ -24,8 +24,13 @@ _srcname=FFmpeg
_version='8.1'
_commit='b57fbbe50c9b2656fad86a1a7eeabfd2b2a50935' # v4l2-request-n8.1 tip 2026-04-24
pkgver=8.1.r123329.b57fbbe
pkgrel=5
pkgrel=10 # pkgrel=10 — H.264 luma qpel mc20 daedalus-fourier substitution (cycle 9, 2026-05-23)
epoch=2
# daedalus-fourier pin. 209a421 = PR #2 merge (Phase 8c — public API
# gains daedalus_recipe_dispatch_h264_qpel_mc20 + DAEDALUS_KERNEL_H264_QPEL_MC20).
# Cycle 9 closes the libavcodec.so substitution arc started at cycle 6.
_daedalus_fourier_commit='b9f9ff2a89c068aea54dcb52b543afddad28311e' # PR #25 — public chroma DC Hadamard symbol
pkgdesc='FFmpeg with V4L2 Request API hwaccel (Rockchip / Allwinner stateless decode)'
arch=('aarch64')
url='https://github.com/Kwiboo/FFmpeg'
@@ -34,6 +39,7 @@ depends=(
alsa-lib
bzip2
fontconfig
vulkan-icd-loader
fribidi
gmp
gnutls
@@ -59,10 +65,13 @@ depends=(
zlib
)
makedepends=(
cmake
git
linux-api-headers
mesa
nasm
ninja
vulkan-headers
)
provides=(
libavcodec.so
@@ -78,9 +87,21 @@ provides=(
conflicts=(ffmpeg)
replaces=(ffmpeg ffmpeg-v4l2-request-git)
source=("git+https://github.com/Kwiboo/FFmpeg.git#commit=${_commit}"
"daedalus-fourier-${_daedalus_fourier_commit}.tar.gz::https://git.reauktion.de/marfrit/daedalus-fourier/archive/${_daedalus_fourier_commit}.tar.gz"
'0001-libudev-bypass-fallback.patch'
'0002-nv15-to-p010-unpack.patch')
sha256sums=('SKIP' 'SKIP' 'SKIP')
'0002-nv15-to-p010-unpack.patch'
'0003-h264-idct4-daedalus-fourier.patch'
'0004-h264-idct8-daedalus-fourier.patch'
'0005-h264-deblock-luma-v-daedalus-fourier.patch'
'0006-h264-restore-low-delay.patch'
'0007-h264-qpel-mc20-daedalus-fourier.patch'
'0008-h264-deblock-luma-h-daedalus-fourier.patch'
'0009-h264-deblock-chroma-daedalus-fourier.patch'
'0010-h264-deblock-luma-intra-daedalus-fourier.patch'
'0011-h264-chroma-dc-hadamard-daedalus-fourier.patch'
'0012-h264-qpel-rest-daedalus-fourier.patch'
'0013-h264-deblock-chroma-intra-daedalus-fourier.patch')
sha256sums=('SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP')
pkgver() {
cd "${_srcname}"
@@ -93,9 +114,35 @@ prepare() {
cd "${_srcname}"
patch -Np1 -i "${srcdir}/0001-libudev-bypass-fallback.patch"
patch -Np1 -i "${srcdir}/0002-nv15-to-p010-unpack.patch"
patch -Np1 -i "${srcdir}/0003-h264-idct4-daedalus-fourier.patch"
patch -Np1 -i "${srcdir}/0004-h264-idct8-daedalus-fourier.patch"
patch -Np1 -i "${srcdir}/0005-h264-deblock-luma-v-daedalus-fourier.patch"
patch -Np1 -i "${srcdir}/0006-h264-restore-low-delay.patch"
patch -Np1 -i "${srcdir}/0007-h264-qpel-mc20-daedalus-fourier.patch"
patch -Np1 -i "${srcdir}/0008-h264-deblock-luma-h-daedalus-fourier.patch"
patch -Np1 -i "${srcdir}/0009-h264-deblock-chroma-daedalus-fourier.patch"
patch -Np1 -i "${srcdir}/0010-h264-deblock-luma-intra-daedalus-fourier.patch"
patch -Np1 -i "${srcdir}/0011-h264-chroma-dc-hadamard-daedalus-fourier.patch"
patch -Np1 -i "${srcdir}/0012-h264-qpel-rest-daedalus-fourier.patch"
patch -Np1 -i "${srcdir}/0013-h264-deblock-chroma-intra-daedalus-fourier.patch"
}
build() {
# --- daedalus-fourier: build static .a with PIC, install to a
# per-build prefix; libavcodec.so links it into the shared object so
# H264DSPContext.idct_add (and follow-up kernels) dispatch through
# the daedalus recipe layer instead of the in-tree NEON .S code. ---
local _fourier_prefix="${srcdir}/fourier-prefix"
mkdir -p "${_fourier_prefix}"
pushd "${srcdir}"/daedalus-fourier >/dev/null
cmake -B build -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_POSITION_INDEPENDENT_CODE=ON \
-DCMAKE_INSTALL_PREFIX="${_fourier_prefix}"
cmake --build build --target daedalus_core
cmake --install build
popd >/dev/null
cd "${_srcname}"
# FFmpeg's configure resolves the compiler via `which` and bakes the
@@ -147,6 +194,9 @@ build() {
--enable-libx265 \
--enable-libwebp \
\
--extra-cflags="-I${_fourier_prefix}/include" \
--extra-ldflags="-L${_fourier_prefix}/lib" \
--extra-libs="-ldaedalus_core -lvulkan -lpthread" \
--host-cflags='-fPIC'
make
@@ -0,0 +1,57 @@
From: claude-noether (on behalf of mfritsche)
Date: 2026-05-19
Subject: panvk: expose VK_KHR/EXT_robustness2 + nullDescriptor on Bifrost (PAN_ARCH 6/7)
Without this, Mesa's Zink driver refuses to use PanVk-Bifrost as its Vulkan
backend, falling back silently to llvmpipe (software rasterizer) for all
GL-via-Zink on Bifrost SBCs. That defeats the entire purpose of having a
Vulkan driver on Bifrost — GL acceleration via Zink is the most natural
near-term consumer.
panvk_vX_nir_lower_descriptors.c:1309 and panvk_vX_shader.c:1355 already
plumb dev->vk.enabled_features.nullDescriptor arch-agnostically — the gate
at panvk_vX_physical_device.c was set conservatively when Bifrost was
unmaintained, not because of hardware incapability.
iter17 of the panvk-bifrost campaign proved fundamental driver functions
on Mali-G52 r1 MC1 (PAN_ARCH=7). This patch is the iter8 follow-up.
robustBufferAccess2 and robustImageAccess2 are NOT flipped — they're
independent rb2 features Zink doesn't require, gated differently
(robustBufferAccess2 = PAN_ARCH >= 11, robustImageAccess2 = false), and
out of scope for iter8.
---
src/panfrost/vulkan/panvk_vX_physical_device.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/src/panfrost/vulkan/panvk_vX_physical_device.c b/src/panfrost/vulkan/panvk_vX_physical_device.c
--- a/src/panfrost/vulkan/panvk_vX_physical_device.c
+++ b/src/panfrost/vulkan/panvk_vX_physical_device.c
@@ -91,7 +91,7 @@ get_device_extensions(const struct panvk_physical_device *device,
.KHR_pipeline_binary = true,
.KHR_pipeline_executable_properties = true,
.KHR_pipeline_library = true,
- .KHR_robustness2 = PAN_ARCH >= 10,
+ .KHR_robustness2 = true,
.KHR_sampler_mirror_clamp_to_edge = true,
.KHR_sampler_ycbcr_conversion = true,
.KHR_separate_depth_stencil_layouts = true,
@@ -168,7 +168,7 @@ get_device_extensions(const struct panvk_physical_device *device,
.EXT_queue_family_foreign = true,
.EXT_robustness = pan_arch(device->kmod.dev->props.gpu_id) >= 9,
.EXT_image_robustness = true,
- .EXT_robustness2 = PAN_ARCH >= 10,
+ .EXT_robustness2 = true,
.EXT_sampler_filter_minmax = PAN_ARCH >= 10,
.EXT_scalar_block_layout = true,
.EXT_separate_stencil_usage = true,
@@ -493,7 +493,7 @@ get_device_features(const struct panvk_physical_device *device,
/* VK_KHR_robustness2 */
.robustBufferAccess2 = PAN_ARCH >= 11,
.robustImageAccess2 = false,
- .nullDescriptor = PAN_ARCH >= 10,
+ .nullDescriptor = true,
/* VK_KHR_shader_clock */
.shaderSubgroupClock = device->kmod.dev->props.gpu_can_query_timestamp,
@@ -0,0 +1,47 @@
From: claude-noether (on behalf of mfritsche)
Date: 2026-05-20
Subject: panvk: expose Vulkan 1.1 + 1.2 on Bifrost (PAN_ARCH 6/7)
ANGLE (Chromium's GL stack) requires apiVersion >= 1.1 to initialize. Without
this, Brave / Chromium's GPU process fails at GL info collection:
vk_renderer.cpp:2659 (initialize): ANGLE Requires a minimum Vulkan device
version of 1.1
Display::initialize error 0: Internal Vulkan error (-9): The requested
version of Vulkan is not supported by the driver
Stack-up with iter8's robustness2 patch enables ANGLE → PanVk-Bifrost →
Skia (via --enable-features=Vulkan) on Bifrost SBCs.
PanVk-Bifrost already supports the bulk of 1.1-promoted features as extensions
(multiview, maintenance1-3, descriptor update template, 16-bit storage,
descriptor update template, sampler ycbcr, variable pointers, etc. — all
visible in iter0 vulkaninfo). The version bump primarily bundles them.
Risk: Vulkan 1.1 has features beyond what iter17 exercised (protected memory,
full subgroup ops). Specific app failures will be characterizable.
1.2 is also flipped — Brave's Vulkan path may want descriptor indexing,
buffer device address, etc. (all listed in iter0 vulkaninfo as supported
extensions, just gated as 1.0-with-extensions, not 1.2-core).
---
src/panfrost/vulkan/panvk_vX_physical_device.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/panfrost/vulkan/panvk_vX_physical_device.c b/src/panfrost/vulkan/panvk_vX_physical_device.c
--- a/src/panfrost/vulkan/panvk_vX_physical_device.c
+++ b/src/panfrost/vulkan/panvk_vX_physical_device.c
@@ -38,8 +38,8 @@ get_device_extensions(const struct panvk_physical_device *device,
struct vk_device_extension_table *ext)
{
*ext = (struct vk_device_extension_table){
- .KHR_8bit_storage = true,
- .KHR_16bit_storage = true,
- bool has_vk1_1 = PAN_ARCH >= 10;
- bool has_vk1_2 = PAN_ARCH >= 10;
+ .KHR_8bit_storage = true,
+ .KHR_16bit_storage = true,
+ bool has_vk1_1 = true;
+ bool has_vk1_2 = true;
*ext = (struct vk_device_extension_table){
@@ -0,0 +1,328 @@
--- a/src/panfrost/vulkan/panvk_shader.h 2026-04-29 22:19:00.000000000 +0200
+++ b/src/panfrost/vulkan/panvk_shader.h 2026-05-20 18:52:53.312698258 +0200
@@ -150,6 +150,10 @@
struct {
#if PAN_ARCH < 9
int32_t raw_vertex_offset;
+ uint32_t num_vertices; /* iter13: XFB needs per-draw vertex count */
+ /* aligned_u64 attribute below inserts the 4-byte alignment gap
+ * after num_vertices automatically — no explicit pad needed. */
+ aligned_u64 xfb_address[4]; /* iter13: 4 transform feedback buffer base addresses */
#endif
int32_t first_vertex;
int32_t base_instance;
--- a/src/panfrost/vulkan/panvk_vX_physical_device.c 2026-05-20 19:09:29.711145446 +0200
+++ b/src/panfrost/vulkan/panvk_vX_physical_device.c 2026-05-20 18:52:54.832720445 +0200
@@ -169,6 +169,7 @@
.EXT_provoking_vertex = true,
.EXT_queue_family_foreign = true,
.EXT_robustness2 = true,
+ .EXT_transform_feedback = PAN_ARCH < 9, /* iter13: JM-class only for now */
.EXT_sampler_filter_minmax = PAN_ARCH >= 10,
.EXT_scalar_block_layout = true,
.EXT_separate_stencil_usage = true,
@@ -495,6 +496,10 @@
.robustImageAccess2 = false,
.nullDescriptor = true,
+ /* VK_EXT_transform_feedback (iter13) */
+ .transformFeedback = PAN_ARCH < 9,
+ .geometryStreams = false,
+
/* VK_KHR_shader_clock */
.shaderSubgroupClock = device->kmod.dev->props.gpu_can_query_timestamp,
.shaderDeviceClock = device->kmod.dev->props.timestamp_device_coherent,
@@ -1020,6 +1025,18 @@
.robustStorageBufferAccessSizeAlignment = 1,
.robustUniformBufferAccessSizeAlignment = 1,
+ /* VK_EXT_transform_feedback (iter13) */
+ .maxTransformFeedbackStreams = 1,
+ .maxTransformFeedbackBuffers = 4,
+ .maxTransformFeedbackBufferSize = UINT32_MAX,
+ .maxTransformFeedbackStreamDataSize = 512,
+ .maxTransformFeedbackBufferDataSize = 512,
+ .maxTransformFeedbackBufferDataStride = 2048,
+ .transformFeedbackQueries = false,
+ .transformFeedbackStreamsLinesTriangles = false,
+ .transformFeedbackRasterizationStreamSelect = false,
+ .transformFeedbackDraw = false,
+
/* VK_EXT_shader_object */
/* We do not currently support VK_EXT_shader_object but this is used
* internally by vk_shader
--- a/src/panfrost/vulkan/panvk_vX_shader.c 2026-04-29 22:19:00.000000000 +0200
+++ b/src/panfrost/vulkan/panvk_vX_shader.c 2026-05-20 18:52:56.556745611 +0200
@@ -21,6 +21,7 @@
#include "panvk_physical_device.h"
#include "panvk_sampler.h"
#include "panvk_shader.h"
+#include "pan_nir.h" /* iter13: pan_nir_lower_xfb */
#include "spirv/nir_spirv.h"
#include "util/memstream.h"
@@ -100,6 +101,20 @@
case nir_intrinsic_load_raw_vertex_offset_pan:
val = load_sysval(b, graphics, bit_size, vs.raw_vertex_offset);
break;
+ case nir_intrinsic_load_num_vertices: /* iter13: XFB index calc */
+ val = load_sysval(b, graphics, bit_size, vs.num_vertices);
+ break;
+ case nir_intrinsic_load_xfb_address: { /* iter13: XFB buffer N base address */
+ unsigned idx = nir_intrinsic_base(intr);
+ switch (idx) {
+ case 0: val = load_sysval(b, graphics, bit_size, vs.xfb_address[0]); break;
+ case 1: val = load_sysval(b, graphics, bit_size, vs.xfb_address[1]); break;
+ case 2: val = load_sysval(b, graphics, bit_size, vs.xfb_address[2]); break;
+ case 3: val = load_sysval(b, graphics, bit_size, vs.xfb_address[3]); break;
+ default: return false;
+ }
+ break;
+ }
case nir_intrinsic_load_layer_id:
assert(b->shader->info.stage == MESA_SHADER_FRAGMENT);
val = load_sysval(b, graphics, bit_size, layer_id);
@@ -457,6 +472,7 @@
core_max_id);
pan_preprocess_nir(nir, pdev->kmod.dev->props.gpu_id);
+
}
static void
@@ -870,6 +886,18 @@
nir_var_shader_in | nir_var_shader_out, UINT32_MAX);
NIR_PASS(_, nir, nir_lower_io, nir_var_shader_in | nir_var_shader_out,
glsl_type_size, nir_lower_io_use_interpolated_input_intrinsics);
+
+#if PAN_ARCH < 9
+ /* iter13: VK_EXT_transform_feedback — runs AFTER nir_lower_io so that
+ * shader outputs are now store_output intrinsics that pan_nir_lower_xfb
+ * can rewrite to nir_store_global+nir_load_xfb_address. */
+ if (nir->info.stage == MESA_SHADER_VERTEX &&
+ nir->info.has_transform_feedback_varyings) {
+ NIR_PASS(_, nir, nir_opt_constant_folding);
+ NIR_PASS(_, nir, nir_io_add_intrinsic_xfb_info);
+ NIR_PASS(_, nir, pan_nir_lower_xfb);
+ }
+#endif
}
static VkResult
@@ -1288,6 +1316,9 @@
.view_mask = (state && state->rp) ? state->rp->view_mask : 0,
.robust2_modes = robust2_modes,
.robust_descriptors = dev->vk.enabled_features.nullDescriptor,
+ /* iter13: XFB shaders must disable IDVS (matches Panfrost-Gallium). */
+ .no_idvs = (info->stage == MESA_SHADER_VERTEX) &&
+ info->nir->info.has_transform_feedback_varyings,
};
switch (info->stage) {
--- a/src/panfrost/vulkan/panvk_cmd_draw.h 2026-04-29 22:19:00.000000000 +0200
+++ b/src/panfrost/vulkan/panvk_cmd_draw.h 2026-05-20 18:52:57.748763011 +0200
@@ -135,6 +135,19 @@
struct panvk_graphics_sysvals sysvals;
#if PAN_ARCH < 9
+ /* iter13: VK_EXT_transform_feedback state (JM-class only for now). */
+ struct {
+ bool active;
+ uint32_t buffer_count;
+ struct {
+ uint64_t addr;
+ uint64_t offset;
+ uint64_t size;
+ } buffers[4];
+ } xfb;
+#endif
+
+#if PAN_ARCH < 9
struct panvk_shader_link link;
#endif
--- a/src/panfrost/vulkan/panvk_vX_cmd_draw.c 2026-04-29 22:19:00.000000000 +0200
+++ b/src/panfrost/vulkan/panvk_vX_cmd_draw.c 2026-05-20 19:10:23.031919662 +0200
@@ -10,6 +10,7 @@
#include "panvk_entrypoints.h"
#include "pan_desc.h"
+#include "pan_compiler.h" /* PAN_SHADER_OOB_ADDRESS */
#include "pan_util.h"
static void
@@ -722,6 +723,35 @@
set_gfx_sysval(cmdbuf, dirty_sysvals, vs.raw_vertex_offset,
info->vertex.raw_offset);
set_gfx_sysval(cmdbuf, dirty_sysvals, layer_id, info->layer_id);
+
+ /* iter13: VK_EXT_transform_feedback sysvals — always set (per draw),
+ * reflect bound XFB state. set_gfx_sysval is a no-op if value unchanged. */
+ set_gfx_sysval(cmdbuf, dirty_sysvals, vs.num_vertices, info->vertex.count);
+ {
+ const struct panvk_cmd_graphics_state *_gfx = &cmdbuf->state.gfx;
+ /* iter13: default each XFB buffer address to PAN_SHADER_OOB_ADDRESS
+ * (= 1<<63). This is the Panfrost-Gallium memory-sink idiom — the
+ * Bifrost MMU silently discards stores to this address, so a pipeline
+ * with XFB outputs used in a non-XFB draw (or in an XFB draw with
+ * fewer bound buffers than the shader declares) is safe instead of
+ * faulting. See gallium/drivers/panfrost/pan_cmdstream.c PAN_SYSVAL_XFB. */
+ uint64_t _xa0 = PAN_SHADER_OOB_ADDRESS, _xa1 = PAN_SHADER_OOB_ADDRESS,
+ _xa2 = PAN_SHADER_OOB_ADDRESS, _xa3 = PAN_SHADER_OOB_ADDRESS;
+ if (_gfx->xfb.active) {
+ if (_gfx->xfb.buffer_count > 0 && _gfx->xfb.buffers[0].addr)
+ _xa0 = _gfx->xfb.buffers[0].addr + _gfx->xfb.buffers[0].offset;
+ if (_gfx->xfb.buffer_count > 1 && _gfx->xfb.buffers[1].addr)
+ _xa1 = _gfx->xfb.buffers[1].addr + _gfx->xfb.buffers[1].offset;
+ if (_gfx->xfb.buffer_count > 2 && _gfx->xfb.buffers[2].addr)
+ _xa2 = _gfx->xfb.buffers[2].addr + _gfx->xfb.buffers[2].offset;
+ if (_gfx->xfb.buffer_count > 3 && _gfx->xfb.buffers[3].addr)
+ _xa3 = _gfx->xfb.buffers[3].addr + _gfx->xfb.buffers[3].offset;
+ }
+ set_gfx_sysval(cmdbuf, dirty_sysvals, vs.xfb_address[0], _xa0);
+ set_gfx_sysval(cmdbuf, dirty_sysvals, vs.xfb_address[1], _xa1);
+ set_gfx_sysval(cmdbuf, dirty_sysvals, vs.xfb_address[2], _xa2);
+ set_gfx_sysval(cmdbuf, dirty_sysvals, vs.xfb_address[3], _xa3);
+ }
#endif
if (dyn_gfx_state_dirty(cmdbuf, CB_BLEND_CONSTANTS)) {
--- a/src/panfrost/vulkan/meson.build 2026-04-29 22:19:00.000000000 +0200
+++ b/src/panfrost/vulkan/meson.build 2026-05-20 18:53:04.484861338 +0200
@@ -73,6 +73,7 @@
jm_inc_dir = ['jm']
jm_files = [
'jm/panvk_vX_bind_queue.c',
+ 'jm/panvk_vX_cmd_xfb.c', # iter13
'jm/panvk_vX_cmd_buffer.c',
'jm/panvk_vX_cmd_dispatch.c',
'jm/panvk_vX_cmd_draw.c',
--- a/src/panfrost/vulkan/jm/panvk_vX_cmd_buffer.c 2026-04-29 22:19:00.000000000 +0200
+++ b/src/panfrost/vulkan/jm/panvk_vX_cmd_buffer.c 2026-05-20 19:10:26.163965149 +0200
@@ -473,5 +473,12 @@
vk_command_buffer_begin(&cmdbuf->vk, pBeginInfo);
+#if PAN_ARCH < 9
+ /* iter13: clear XFB state on Begin so a reused command buffer does not
+ * inherit stale xfb.buffer_count / xfb.active / xfb.buffers[] from a
+ * prior recording. */
+ memset(&cmdbuf->state.gfx.xfb, 0, sizeof(cmdbuf->state.gfx.xfb));
+#endif
+
return VK_SUCCESS;
}
--- a/src/panfrost/vulkan/jm/panvk_vX_cmd_xfb.c 2026-05-18 12:50:53.067999996 +0200
+++ b/src/panfrost/vulkan/jm/panvk_vX_cmd_xfb.c 2026-05-20 19:10:27.175979847 +0200
@@ -0,0 +1,111 @@
+/*
+ * Copyright © 2026 mfritsche / claude-noether
+ * SPDX-License-Identifier: MIT
+ *
+ * iter13: VK_EXT_transform_feedback command handlers for the JM
+ * architecture path (Bifrost v6/v7 + Valhall-JM v9).
+ *
+ * The runtime contract:
+ * - vkCmdBindTransformFeedbackBuffersEXT: stash (gpu_addr, offset, size)
+ * for each slot into cmdbuf->state.gfx.xfb.buffers[].
+ * - vkCmdBeginTransformFeedbackEXT: set cmdbuf->state.gfx.xfb.active = true.
+ * Mark sysvals dirty so the next draw re-emits vs.xfb_address[].
+ * - vkCmdEndTransformFeedbackEXT: set active = false.
+ *
+ * Counter buffers (firstCounterBuffer/counterBufferCount/pCounterBuffers/
+ * pCounterBufferOffsets) are accepted by API but ignored — v1 doesn't
+ * support pause/resume. transformFeedbackDraw is advertised as false.
+ *
+ * Per-draw integration: jm/panvk_vX_cmd_draw.c reads cmdbuf->state.gfx.xfb
+ * and populates vs.xfb_address[i] for shader use. The pan_nir_lower_xfb
+ * pass in panvk_vX_shader.c emits nir_load_xfb_address(i) which lowers
+ * (via panvk_vX_shader.c sysval handler) to a load from the per-draw
+ * sysval push area.
+ */
+
+#include "vk_log.h"
+#include "util/log.h"
+
+#include "panvk_cmd_buffer.h"
+#include "panvk_cmd_draw.h"
+#include "panvk_buffer.h"
+#include "panvk_entrypoints.h"
+
+VKAPI_ATTR void VKAPI_CALL
+panvk_per_arch(CmdBindTransformFeedbackBuffersEXT)(
+ VkCommandBuffer commandBuffer,
+ uint32_t firstBinding,
+ uint32_t bindingCount,
+ const VkBuffer *pBuffers,
+ const VkDeviceSize *pOffsets,
+ const VkDeviceSize *pSizes)
+{
+ VK_FROM_HANDLE(panvk_cmd_buffer, cmdbuf, commandBuffer);
+ struct panvk_cmd_graphics_state *gfx = &cmdbuf->state.gfx;
+
+ for (uint32_t i = 0; i < bindingCount; i++) {
+ uint32_t slot = firstBinding + i;
+ if (slot >= 4)
+ continue;
+
+ VK_FROM_HANDLE(panvk_buffer, buf, pBuffers[i]);
+ gfx->xfb.buffers[slot].addr = panvk_buffer_gpu_ptr(buf, 0);
+ gfx->xfb.buffers[slot].offset = pOffsets[i];
+ gfx->xfb.buffers[slot].size =
+ (pSizes != NULL && pSizes[i] != VK_WHOLE_SIZE)
+ ? pSizes[i]
+ : (buf->vk.size - pOffsets[i]);
+ }
+
+ if (firstBinding + bindingCount > gfx->xfb.buffer_count)
+ gfx->xfb.buffer_count = firstBinding + bindingCount;
+}
+
+VKAPI_ATTR void VKAPI_CALL
+panvk_per_arch(CmdBeginTransformFeedbackEXT)(
+ VkCommandBuffer commandBuffer,
+ uint32_t firstCounterBuffer,
+ uint32_t counterBufferCount,
+ const VkBuffer *pCounterBuffers,
+ const VkDeviceSize *pCounterBufferOffsets)
+{
+ VK_FROM_HANDLE(panvk_cmd_buffer, cmdbuf, commandBuffer);
+ struct panvk_cmd_graphics_state *gfx = &cmdbuf->state.gfx;
+
+ /* Counter buffers ignored in v1 — see VkPhysicalDeviceTransformFeedback
+ * PropertiesEXT.transformFeedbackDraw = false in panvk_vX_physical_device.c.
+ * App is spec-compliant if it does not pass counter buffers (which our
+ * features advertisement allows), but warn loudly if it does so we do not
+ * silently produce wrong capture state. */
+ (void)firstCounterBuffer;
+ (void)pCounterBufferOffsets;
+ if (counterBufferCount > 0 && pCounterBuffers != NULL) {
+ mesa_logw("panvk: CmdBeginTransformFeedbackEXT: counter buffers not "
+ "implemented (transformFeedbackDraw=false); XFB resume will "
+ "restart at buffer offset 0");
+ }
+
+ gfx->xfb.active = true;
+ /* Per-draw set_gfx_sysval picks up the change automatically — no
+ * explicit dirty marking required (set_gfx_sysval uses memcmp +
+ * BITSET to detect state diffs and re-emit sysvals). */
+}
+
+VKAPI_ATTR void VKAPI_CALL
+panvk_per_arch(CmdEndTransformFeedbackEXT)(
+ VkCommandBuffer commandBuffer,
+ uint32_t firstCounterBuffer,
+ uint32_t counterBufferCount,
+ const VkBuffer *pCounterBuffers,
+ const VkDeviceSize *pCounterBufferOffsets)
+{
+ VK_FROM_HANDLE(panvk_cmd_buffer, cmdbuf, commandBuffer);
+ struct panvk_cmd_graphics_state *gfx = &cmdbuf->state.gfx;
+
+ (void)firstCounterBuffer;
+ (void)counterBufferCount;
+ (void)pCounterBuffers;
+ (void)pCounterBufferOffsets;
+
+ gfx->xfb.active = false;
+}
@@ -0,0 +1,629 @@
diff -urN a/src/panfrost/vulkan/meson.build b/src/panfrost/vulkan/meson.build
--- a/src/panfrost/vulkan/meson.build 2026-05-21 14:04:02.529474145 +0200
+++ b/src/panfrost/vulkan/meson.build 2026-05-21 14:04:04.106755486 +0200
@@ -123,6 +123,7 @@
'panvk_vX_nir_lower_input_attachment_loads.c',
'panvk_vX_sampler.c',
'panvk_vX_shader.c',
+ 'panvk_vX_xfb_lower.c',
sha1_h,
]
diff -urN a/src/panfrost/vulkan/panvk_shader.h b/src/panfrost/vulkan/panvk_shader.h
--- a/src/panfrost/vulkan/panvk_shader.h 2026-05-21 14:04:02.525251986 +0200
+++ b/src/panfrost/vulkan/panvk_shader.h 2026-05-21 14:04:04.084251800 +0200
@@ -154,6 +154,8 @@
/* aligned_u64 attribute below inserts the 4-byte alignment gap
* after num_vertices automatically — no explicit pad needed. */
aligned_u64 xfb_address[4]; /* iter13: 4 transform feedback buffer base addresses */
+ uint32_t xfb_topology; /* iter17: panvk_xfb_topology enum value */
+ uint32_t xfb_output_count; /* iter17: per-instance output verts after decomp */
#endif
int32_t first_vertex;
int32_t base_instance;
@@ -569,4 +571,76 @@
struct pan_compute_dim local_size, const void *bin_ptr, size_t bin_size,
struct panvk_shader **shader_out);
+
+#if PAN_ARCH < 9
+/* iter17: encoding for vs.xfb_topology sysval. Maps VkPrimitiveTopology values
+ * we need to distinguish at shader runtime for XFB capture. LIST topologies
+ * use the iter13 single-store fast path; non-LIST need per-vertex decomposition. */
+enum panvk_xfb_topology {
+ PANVK_XFB_TOPO_LIST = 0,
+ PANVK_XFB_TOPO_LINE_STRIP = 1,
+ PANVK_XFB_TOPO_TRI_STRIP = 2,
+ PANVK_XFB_TOPO_TRI_FAN = 3,
+ PANVK_XFB_TOPO_LINE_LIST_ADJ = 4,
+ PANVK_XFB_TOPO_LINE_STRIP_ADJ = 5,
+ PANVK_XFB_TOPO_TRI_LIST_ADJ = 6,
+ PANVK_XFB_TOPO_TRI_STRIP_ADJ = 7,
+};
+
+#include "panvk_macros.h"
+struct nir_shader;
+bool panvk_per_arch(nir_lower_xfb)(struct nir_shader *nir);
+
+/* Map VkPrimitiveTopology to panvk_xfb_topology enum (driver-side helper). */
+static inline uint32_t
+panvk_vk_topology_to_xfb_enum(VkPrimitiveTopology topo)
+{
+ switch (topo) {
+ case VK_PRIMITIVE_TOPOLOGY_LINE_STRIP:
+ return PANVK_XFB_TOPO_LINE_STRIP;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_STRIP:
+ return PANVK_XFB_TOPO_TRI_STRIP;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_FAN:
+ return PANVK_XFB_TOPO_TRI_FAN;
+ case VK_PRIMITIVE_TOPOLOGY_LINE_LIST_WITH_ADJACENCY:
+ return PANVK_XFB_TOPO_LINE_LIST_ADJ;
+ case VK_PRIMITIVE_TOPOLOGY_LINE_STRIP_WITH_ADJACENCY:
+ return PANVK_XFB_TOPO_LINE_STRIP_ADJ;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_LIST_WITH_ADJACENCY:
+ return PANVK_XFB_TOPO_TRI_LIST_ADJ;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_STRIP_WITH_ADJACENCY:
+ return PANVK_XFB_TOPO_TRI_STRIP_ADJ;
+ case VK_PRIMITIVE_TOPOLOGY_POINT_LIST:
+ case VK_PRIMITIVE_TOPOLOGY_LINE_LIST:
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_LIST:
+ default:
+ return PANVK_XFB_TOPO_LIST;
+ }
+}
+
+/* Compute the per-instance output vertex count for a given (topology, input count). */
+static inline uint32_t
+panvk_xfb_output_count(VkPrimitiveTopology topo, uint32_t input_count)
+{
+ switch (topo) {
+ case VK_PRIMITIVE_TOPOLOGY_LINE_STRIP:
+ return input_count >= 1 ? 2u * (input_count - 1u) : 0u;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_STRIP:
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_FAN:
+ return input_count >= 2 ? 3u * (input_count - 2u) : 0u;
+ case VK_PRIMITIVE_TOPOLOGY_LINE_LIST_WITH_ADJACENCY:
+ return (input_count / 4u) * 2u;
+ case VK_PRIMITIVE_TOPOLOGY_LINE_STRIP_WITH_ADJACENCY:
+ return input_count >= 3 ? 2u * (input_count - 3u) : 0u;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_LIST_WITH_ADJACENCY:
+ return (input_count / 6u) * 3u;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_STRIP_WITH_ADJACENCY:
+ return input_count >= 6 ? 3u * (input_count / 2u - 2u) : 0u;
+ default:
+ return input_count; /* LIST topologies: 1:1 mapping */
+ }
+}
+#endif
+
+
#endif
diff -urN a/src/panfrost/vulkan/panvk_vX_cmd_draw.c b/src/panfrost/vulkan/panvk_vX_cmd_draw.c
--- a/src/panfrost/vulkan/panvk_vX_cmd_draw.c 2026-05-21 14:04:02.528576354 +0200
+++ b/src/panfrost/vulkan/panvk_vX_cmd_draw.c 2026-05-21 14:04:04.091357598 +0200
@@ -727,6 +727,20 @@
/* iter13: VK_EXT_transform_feedback sysvals — always set (per draw),
* reflect bound XFB state. set_gfx_sysval is a no-op if value unchanged. */
set_gfx_sysval(cmdbuf, dirty_sysvals, vs.num_vertices, info->vertex.count);
+
+ /* iter17: XFB primitive-decomposition sysvals.
+ * xfb_topology = enum value for the current bound topology.
+ * xfb_output_count = per-instance output vertex count after decomposition.
+ * For LIST topologies, output_count == input vertex count and the shader
+ * takes the iter13 single-store fast path. */
+ {
+ VkPrimitiveTopology vk_topo =
+ cmdbuf->vk.dynamic_graphics_state.ia.primitive_topology;
+ uint32_t topo_enum = panvk_vk_topology_to_xfb_enum(vk_topo);
+ uint32_t out_count = panvk_xfb_output_count(vk_topo, info->vertex.count);
+ set_gfx_sysval(cmdbuf, dirty_sysvals, vs.xfb_topology, topo_enum);
+ set_gfx_sysval(cmdbuf, dirty_sysvals, vs.xfb_output_count, out_count);
+ }
{
const struct panvk_cmd_graphics_state *_gfx = &cmdbuf->state.gfx;
/* iter13: default each XFB buffer address to PAN_SHADER_OOB_ADDRESS
diff -urN a/src/panfrost/vulkan/panvk_vX_shader.c b/src/panfrost/vulkan/panvk_vX_shader.c
--- a/src/panfrost/vulkan/panvk_vX_shader.c 2026-05-21 14:04:02.527576494 +0200
+++ b/src/panfrost/vulkan/panvk_vX_shader.c 2026-05-21 14:04:04.098356619 +0200
@@ -895,7 +895,10 @@
nir->info.has_transform_feedback_varyings) {
NIR_PASS(_, nir, nir_opt_constant_folding);
NIR_PASS(_, nir, nir_io_add_intrinsic_xfb_info);
- NIR_PASS(_, nir, pan_nir_lower_xfb);
+ /* iter17: panvk-specific replacement for pan_nir_lower_xfb that handles
+ * primitive decomposition for non-LIST topologies. Single-store LIST
+ * fast path matches iter13 behavior. */
+ NIR_PASS(_, nir, panvk_per_arch(nir_lower_xfb));
}
#endif
}
diff -urN a/src/panfrost/vulkan/panvk_vX_xfb_lower.c b/src/panfrost/vulkan/panvk_vX_xfb_lower.c
--- a/src/panfrost/vulkan/panvk_vX_xfb_lower.c 1970-01-01 01:00:00.000000000 +0100
+++ b/src/panfrost/vulkan/panvk_vX_xfb_lower.c 2026-05-21 14:04:04.115354242 +0200
@@ -0,0 +1,486 @@
+/*
+ * Copyright © 2026 mfritsche / claude-noether
+ * SPDX-License-Identifier: MIT
+ *
+ * iter17: panvk-specific replacement for pan_nir_lower_xfb that handles
+ * primitive decomposition for transform_feedback on non-LIST topologies
+ * (TRIANGLE_STRIP/FAN, LINE_STRIP, *_WITH_ADJACENCY).
+ *
+ * Approach: emit a topology dispatch at the start of each store_output
+ * lowering. The shader reads vs.xfb_topology sysval at runtime and branches
+ * into per-topology emission logic. For each affected topology, the lowered
+ * code emits guarded conditional stores — one per primitive this vertex
+ * contributes to, computing the output buffer position via primitive index
+ * and slot within the decomposed primitive.
+ *
+ * For LIST topologies (POINT/LINE/TRIANGLE LIST), takes a fast path that
+ * matches iter13's single-store behavior.
+ *
+ * For TRIANGLE_FAN, the central vertex (v=0) contributes to ALL primitives
+ * as slot 2 — handled via a NIR loop bounded by num_vertices.
+ *
+ * See ~/src/panvk-bifrost/iter17/phase{0,1,2}_*.md for full design context.
+ */
+
+#include "panvk_macros.h"
+
+#if PAN_ARCH < 9
+
+#include "panvk_shader.h"
+
+#include "compiler/nir/nir_builder.h"
+#include "pan_nir.h"
+
+#include <vulkan/vulkan_core.h>
+
+/* ----- Address arithmetic ----- */
+
+static nir_def *
+xfb_store_addr(nir_builder *b, nir_def *buf, nir_def *out_idx,
+ uint16_t stride, uint16_t offset_bytes)
+{
+ nir_def *byte_off = nir_iadd_imm(b,
+ nir_imul_imm(b, out_idx, stride), offset_bytes);
+ return nir_iadd(b, buf, nir_u2u64(b, byte_off));
+}
+
+static void
+emit_list_store(nir_builder *b, nir_def *buf, nir_def *output_count,
+ nir_def *instance_id, nir_def *raw_vid, nir_def *value,
+ uint16_t stride, uint16_t offset_bytes)
+{
+ nir_def *out_idx = nir_iadd(b,
+ nir_imul(b, instance_id, output_count), raw_vid);
+ nir_def *addr = xfb_store_addr(b, buf, out_idx, stride, offset_bytes);
+ nir_store_global(b, value, addr);
+}
+
+static void
+emit_prim_store(nir_builder *b, nir_def *buf, nir_def *output_count,
+ nir_def *instance_id, nir_def *eligible,
+ nir_def *prim_idx, nir_def *slot,
+ uint32_t verts_per_prim,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ nir_push_if(b, eligible);
+ {
+ nir_def *out_idx = nir_iadd(b,
+ nir_imul(b, instance_id, output_count),
+ nir_iadd(b, nir_imul_imm(b, prim_idx, verts_per_prim), slot));
+ nir_def *addr = xfb_store_addr(b, buf, out_idx, stride, offset_bytes);
+ nir_store_global(b, value, addr);
+ }
+ nir_pop_if(b, NULL);
+}
+
+/* ----- Per-topology emission ----- */
+
+/* TRIANGLE_STRIP: vertex v contributes to prims v, v-1, v-2 (per eligibility). */
+static void
+emit_tri_strip(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ nir_def *Nm2 = nir_iadd_imm(b, N, -2);
+ nir_def *Nm1 = nir_iadd_imm(b, N, -1);
+
+ /* Prim v, slot 0: v < N-2 */
+ emit_prim_store(b, buf, output_count, instance_id,
+ nir_ult(b, v, Nm2),
+ v, nir_imm_int(b, 0), 3, value, stride, offset_bytes);
+
+ /* Prim v-1, slot = 1 if prim even else 2: 1 <= v < N-1 */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -1);
+ nir_def *parity = nir_iand_imm(b, prim, 1u);
+ nir_def *slot = nir_iadd_imm(b, parity, 1);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 1)),
+ nir_ult(b, v, Nm1));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, slot, 3, value, stride, offset_bytes);
+ }
+
+ /* Prim v-2, slot = 2 if prim even else 1: 2 <= v < N */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -2);
+ nir_def *parity = nir_iand_imm(b, prim, 1u);
+ nir_def *slot = nir_isub(b, nir_imm_int(b, 2), parity);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 2)),
+ nir_ult(b, v, N));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, slot, 3, value, stride, offset_bytes);
+ }
+}
+
+/* LINE_STRIP: vertex v contributes to prim v slot 0 + prim v-1 slot 1. */
+static void
+emit_line_strip(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ nir_def *Nm1 = nir_iadd_imm(b, N, -1);
+
+ /* Prim v, slot 0: v < N-1 */
+ emit_prim_store(b, buf, output_count, instance_id,
+ nir_ult(b, v, Nm1),
+ v, nir_imm_int(b, 0), 2, value, stride, offset_bytes);
+
+ /* Prim v-1, slot 1: 1 <= v < N */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -1);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 1)),
+ nir_ult(b, v, N));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, nir_imm_int(b, 1), 2, value, stride, offset_bytes);
+ }
+}
+
+/* TRIANGLE_FAN: prim p emits {p+1, p+2, 0}.
+ * vertex v=0: contributes to ALL prims as slot 2 (loop required)
+ * vertex v>=1: contributes to prim v-1 as slot 0 (if 1 <= v <= N-2)
+ * vertex v>=2: contributes to prim v-2 as slot 1 (if 2 <= v <= N-1)
+ */
+static void
+emit_tri_fan(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ nir_def *Nm1 = nir_iadd_imm(b, N, -1);
+ nir_def *Nm2 = nir_iadd_imm(b, N, -2);
+
+ /* Prim v-1, slot 0: 1 <= v < N-1 */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -1);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 1)),
+ nir_ult(b, v, Nm1));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, nir_imm_int(b, 0), 3, value, stride, offset_bytes);
+ }
+
+ /* Prim v-2, slot 1: 2 <= v < N */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -2);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 2)),
+ nir_ult(b, v, N));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, nir_imm_int(b, 1), 3, value, stride, offset_bytes);
+ }
+
+ /* Central vertex (v == 0): loop over all prims, write to slot 2. */
+ nir_push_if(b, nir_ieq_imm(b, v, 0));
+ {
+ nir_variable *p_var = nir_local_variable_create(b->impl,
+ glsl_uint_type(), "fan_p");
+ nir_store_var(b, p_var, nir_imm_int(b, 0), 0x1);
+ nir_push_loop(b);
+ {
+ nir_def *p = nir_load_var(b, p_var);
+ nir_push_if(b, nir_uge(b, p, Nm2));
+ {
+ nir_jump(b, nir_jump_break);
+ }
+ nir_pop_if(b, NULL);
+
+ nir_def *out_idx = nir_iadd(b,
+ nir_imul(b, instance_id, output_count),
+ nir_iadd_imm(b, nir_imul_imm(b, p, 3), 2));
+ nir_def *addr = xfb_store_addr(b, buf, out_idx, stride, offset_bytes);
+ nir_store_global(b, value, addr);
+
+ nir_store_var(b, p_var, nir_iadd_imm(b, p, 1), 0x1);
+ }
+ nir_pop_loop(b, NULL);
+ }
+ nir_pop_if(b, NULL);
+}
+
+/* LINE_LIST_WITH_ADJACENCY: 4-vertex groups [4i..4i+3]; output {4i+1, 4i+2}.
+ * v contributes if v%4 == 1: prim v/4 slot 0
+ * v contributes if v%4 == 2: prim v/4 slot 1
+ */
+static void
+emit_line_list_adj(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ (void)N; /* eligibility is mod-based, not range-based */
+ nir_def *vmod4 = nir_iand_imm(b, v, 3u);
+ nir_def *prim = nir_ushr_imm(b, v, 2); /* v / 4 */
+
+ emit_prim_store(b, buf, output_count, instance_id,
+ nir_ieq_imm(b, vmod4, 1),
+ prim, nir_imm_int(b, 0), 2, value, stride, offset_bytes);
+
+ emit_prim_store(b, buf, output_count, instance_id,
+ nir_ieq_imm(b, vmod4, 2),
+ prim, nir_imm_int(b, 1), 2, value, stride, offset_bytes);
+}
+
+/* LINE_STRIP_WITH_ADJACENCY: prim p emits {p+1, p+2}.
+ * v contributes to prim v-1 slot 0 (1 <= v <= N-2)
+ * v contributes to prim v-2 slot 1 (2 <= v <= N-1)
+ */
+static void
+emit_line_strip_adj(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ nir_def *Nm1 = nir_iadd_imm(b, N, -1);
+ nir_def *Nm2 = nir_iadd_imm(b, N, -2);
+
+ /* Prim v-1, slot 0: 1 <= v <= N-2 ⇔ v >= 1 AND v <= N-2 ⇔ v >= 1 AND v < N-1 */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -1);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 1)),
+ nir_ult(b, v, Nm1));
+ (void)Nm2;
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, nir_imm_int(b, 0), 2, value, stride, offset_bytes);
+ }
+
+ /* Prim v-2, slot 1: 2 <= v <= N-1 ⇔ v >= 2 AND v < N */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -2);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 2)),
+ nir_ult(b, v, N));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, nir_imm_int(b, 1), 2, value, stride, offset_bytes);
+ }
+}
+
+/* TRIANGLE_LIST_WITH_ADJACENCY: 6-vertex groups; output {6i, 6i+2, 6i+4}.
+ * v contributes if v%6 == 0: prim v/6 slot 0
+ * v contributes if v%6 == 2: prim v/6 slot 1
+ * v contributes if v%6 == 4: prim v/6 slot 2
+ */
+static void
+emit_tri_list_adj(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ (void)N;
+ nir_def *vmod6 = nir_umod_imm(b, v, 6);
+ nir_def *prim = nir_udiv_imm(b, v, 6);
+
+ for (uint32_t slot = 0; slot < 3; slot++) {
+ emit_prim_store(b, buf, output_count, instance_id,
+ nir_ieq_imm(b, vmod6, slot * 2),
+ prim, nir_imm_int(b, slot), 3, value, stride, offset_bytes);
+ }
+}
+
+/* TRIANGLE_STRIP_WITH_ADJACENCY: prim i emits:
+ * even i: {2i, 2i+2, 2i+4} (slots 0, 1, 2 ← input indices 2i, 2i+2, 2i+4)
+ * odd i: {2i, 2i+4, 2i+2} (slots 0, 1, 2 ← input indices 2i, 2i+4, 2i+2)
+ *
+ * Only EVEN input vertices contribute (since all output indices are 2*something).
+ * For even input v:
+ * prim v/2 slot 0 (always, if v/2 < N/2-2)
+ * prim (v-2)/2 slot 1 if (v-2)/2 even, slot 2 if odd (when v >= 2)
+ * prim (v-4)/2 slot 2 if (v-4)/2 even, slot 1 if odd (when v >= 4)
+ */
+static void
+emit_tri_strip_adj(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ /* Bail for odd input vertices — they never contribute. */
+ nir_def *v_is_even = nir_ieq_imm(b, nir_iand_imm(b, v, 1u), 0);
+ nir_push_if(b, v_is_even);
+ {
+ nir_def *N_half = nir_ushr_imm(b, N, 1);
+ nir_def *max_prim = nir_iadd_imm(b, N_half, -2); /* N/2 - 2 */
+ nir_def *v_half = nir_ushr_imm(b, v, 1);
+
+ /* Prim v/2 slot 0: v/2 < N/2 - 2 */
+ emit_prim_store(b, buf, output_count, instance_id,
+ nir_ult(b, v_half, max_prim),
+ v_half, nir_imm_int(b, 0), 3, value, stride, offset_bytes);
+
+ /* Prim (v-2)/2 = v/2 - 1: v >= 2 AND prim < N/2-2 */
+ {
+ nir_def *prim = nir_iadd_imm(b, v_half, -1);
+ nir_def *parity = nir_iand_imm(b, prim, 1u);
+ nir_def *slot = nir_iadd_imm(b, parity, 1); /* even→1, odd→2 */
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 2)),
+ nir_ult(b, prim, max_prim));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, slot, 3, value, stride, offset_bytes);
+ }
+
+ /* Prim (v-4)/2 = v/2 - 2: v >= 4 AND prim < N/2-2 */
+ {
+ nir_def *prim = nir_iadd_imm(b, v_half, -2);
+ nir_def *parity = nir_iand_imm(b, prim, 1u);
+ nir_def *slot = nir_isub(b, nir_imm_int(b, 2), parity); /* even→2, odd→1 */
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 4)),
+ nir_ult(b, prim, max_prim));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, slot, 3, value, stride, offset_bytes);
+ }
+ }
+ nir_pop_if(b, NULL);
+}
+
+/* ----- Main lowering: per store_output XFB channel ----- */
+
+static void
+lower_xfb_output_iter17(nir_builder *b, nir_intrinsic_instr *intr,
+ unsigned channel_idx, unsigned num_components,
+ unsigned buffer, unsigned offset_words)
+{
+ assert(buffer < MAX_XFB_BUFFERS);
+ assert(nir_intrinsic_component(intr) == 0);
+
+ uint16_t stride = b->shader->info.xfb_stride[buffer] * 4;
+ assert(stride != 0);
+ uint16_t offset_bytes = offset_words * 4;
+
+ BITSET_SET(b->shader->info.system_values_read, SYSTEM_VALUE_VERTEX_ID_ZERO_BASE);
+ BITSET_SET(b->shader->info.system_values_read, SYSTEM_VALUE_INSTANCE_ID);
+
+ nir_def *topology = load_sysval(b, graphics, 32, vs.xfb_topology);
+ nir_def *out_count = load_sysval(b, graphics, 32, vs.xfb_output_count);
+ nir_def *N = nir_load_num_vertices(b);
+ nir_def *v = nir_load_raw_vertex_id_pan(b);
+ nir_def *instance = nir_load_instance_id(b);
+ nir_def *buf = nir_load_xfb_address(b, 64, .base = buffer);
+
+ nir_def *src = intr->src[0].ssa;
+ nir_component_mask_t mask = nir_component_mask(num_components);
+ nir_def *value = nir_channels(b, src, mask << channel_idx);
+
+ /* Topology dispatch ladder. LIST first (fast path). */
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LIST));
+ {
+ emit_list_store(b, buf, out_count, instance, v, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ /* iter17 Janet Finding 3: gate all non-LIST emission on
+ * output_count > 0. For degenerate input counts (N < min required
+ * for the topology), output_count is 0 and we must emit NO stores
+ * — otherwise N-2 / N-3 / etc. arithmetic underflows in the
+ * eligibility predicates and we falsely fire stores. */
+ nir_push_if(b, nir_ult(b, nir_imm_int(b, 0), out_count));
+ {
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_TRI_STRIP));
+ {
+ emit_tri_strip(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LINE_STRIP));
+ {
+ emit_line_strip(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_TRI_FAN));
+ {
+ emit_tri_fan(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LINE_LIST_ADJ));
+ {
+ emit_line_list_adj(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LINE_STRIP_ADJ));
+ {
+ emit_line_strip_adj(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_TRI_LIST_ADJ));
+ {
+ emit_tri_list_adj(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ /* TRI_STRIP_ADJ — last case */
+ emit_tri_strip_adj(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_pop_if(b, NULL);
+ }
+ nir_pop_if(b, NULL);
+ }
+ nir_pop_if(b, NULL);
+ }
+ nir_pop_if(b, NULL);
+ }
+ nir_pop_if(b, NULL);
+ }
+ nir_pop_if(b, NULL);
+ }
+ nir_pop_if(b, NULL); /* Janet Finding 3: close output_count > 0 guard */
+ }
+ nir_pop_if(b, NULL);
+}
+
+/* Mirror of pan_nir_lower_xfb's lower_xfb: load_vertex_id rewrite +
+ * dispatch store_output through our topology-aware emission. */
+static bool
+lower_xfb_iter17(nir_builder *b, nir_intrinsic_instr *intr,
+ UNUSED void *data)
+{
+ if (intr->intrinsic == nir_intrinsic_load_vertex_id) {
+ b->cursor = nir_instr_remove(&intr->instr);
+ nir_def *repl = nir_iadd(b, nir_load_raw_vertex_id_pan(b),
+ nir_load_raw_vertex_offset_pan(b));
+ nir_def_rewrite_uses(&intr->def, repl);
+ return true;
+ }
+
+ if (intr->intrinsic != nir_intrinsic_store_output)
+ return false;
+
+ bool progress = false;
+ b->cursor = nir_before_instr(&intr->instr);
+
+ /* io_xfb has only out[0,1]; the other 2 channels are in io_xfb2.
+ * Outer loop selects which annotation; inner picks which channel. */
+ for (unsigned i = 0; i < 2; ++i) {
+ nir_io_xfb xfb = i ? nir_intrinsic_io_xfb2(intr)
+ : nir_intrinsic_io_xfb(intr);
+ for (unsigned j = 0; j < 2; ++j) {
+ if (!xfb.out[j].num_components)
+ continue;
+ lower_xfb_output_iter17(b, intr, i * 2 + j, xfb.out[j].num_components,
+ xfb.out[j].buffer, xfb.out[j].offset);
+ progress = true;
+ }
+ }
+
+ if (progress)
+ nir_instr_remove(&intr->instr);
+ return progress;
+}
+
+bool
+panvk_per_arch(nir_lower_xfb)(nir_shader *nir)
+{
+ return nir_shader_intrinsics_pass(
+ nir, lower_xfb_iter17, nir_metadata_control_flow, NULL);
+}
+
+#endif /* PAN_ARCH < 9 */
+181
View File
@@ -0,0 +1,181 @@
# Maintainer: Markus Fritsche <fritsche.markus@gmail.com>
#
# mesa-panvk-bifrost-video — sibling of mesa-panvk-bifrost (r4) that adds
# VK_KHR_video_decode_h264 on Mali Bifrost SBCs (PAN_ARCH 6/7) backed by
# the SoC's V4L2-stateless hantro VPU (RK3566/RK3568).
#
# Campaign: ~/src/panvk-bifrost-video/ — Phase 4 byte-exact validated
# 2026-05-21 (48/48 BBB display frames match ffmpeg+libva-v4l2-request-
# fourier byte-for-byte on the same hantro). Phase 5 second-model review
# completed; load-bearing findings (output_map OOB, static counter,
# session_init unwind, probe_hantro gate) all applied.
#
# What it does (on top of r4):
# - 0001..0004: inherited from mesa-panvk-bifrost (robustness2/null-
# descriptor, vk1.1/1.2 advertisement, EXT_transform_feedback, XFB
# primitive decomposition) — symlinked from the r4 package directory
# so the patches don't drift between siblings.
# - 0005: VK_KHR_video_queue + VK_KHR_video_decode_queue +
# VK_KHR_video_decode_h264 backed by V4L2-stateless hantro.
# Touches 14 files in src/panfrost/vulkan/; full diff in
# 0005-panvk-bifrost-video-KHR-video-decode-h264.patch.
#
# Co-existence:
# - Installs to /usr/lib/panvk-bifrost-video/ (parallel to r4's
# /usr/lib/panvk-bifrost/). Pick at runtime via VK_ICD_FILENAMES.
# - r4 stays the recommended default for the Chromium-GPU-process
# consumer (no video needed there). Use this package when the
# consumer wants Vulkan video decode (mpv-fourier, ffmpeg-vulkan,
# future Chromium-VulkanVideoDecoder).
#
# Phase 1 limitations to know about (documented in source comments):
# - Single video session per device (active_video singleton)
# - Synchronous decode at record time — no pipelining yet
# - Hardcoded /dev/video1 + /dev/media0 (matches RK3566/68, blocks
# other SoCs without a topology-walk port)
# - Bitstream source buffer assumed HOST_VISIBLE (true on panvk-
# bifrost, would need fallback on other backends)
#
# Build target: arch-aarch64 runner via marfrit-packages Gitea Actions.
# Mesa build is slow (~30-60min on Cortex-A55).
pkgname=mesa-panvk-bifrost-video
_mesaver=26.0.6
pkgver=26.0.6.r5.video1
pkgrel=1
pkgdesc="Patched Mesa libvulkan_panfrost.so adding VK_KHR_video_decode_h264 on Bifrost SBCs (sibling of mesa-panvk-bifrost-r4)"
arch=('aarch64')
url="https://git.reauktion.de/marfrit/panvk-bifrost"
license=('MIT')
depends=(
'mesa' # for shared mesa runtime libs
'libdrm'
'wayland'
'libxcb'
'libx11'
'libxshmfence'
'zlib'
'zstd'
'libelf'
'libffi'
'expat'
'llvm-libs'
'lm_sensors'
)
makedepends=(
'meson'
'ninja'
'glslang'
'python-mako'
'python-packaging'
'wayland-protocols'
'libxrandr'
'xorgproto'
'libdrm'
'llvm'
'libclc'
'spirv-llvm-translator'
'spirv-tools'
'rust-bindgen'
'patch'
)
source=(
"https://archive.mesa3d.org/mesa-${_mesaver}.tar.xz"
"0001-panvk-expose-robustness2-nullDescriptor-bifrost.patch"
"0002-panvk-expose-vulkan-1.1-1.2-on-bifrost.patch"
"0003-panvk-bifrost-vk-ext-transform-feedback.patch"
"0004-panvk-bifrost-xfb-primitive-decomposition.patch"
"0005-panvk-bifrost-video-KHR-video-decode-h264.patch"
"icd.json"
)
# Mesa tarball checksum matches the sibling r4 package — same upstream version.
sha256sums=(
'SKIP' # mesa tarball — co-trust w/ r4 sibling
'SKIP' # patches are local
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP' # icd.json
)
prepare() {
cd "mesa-${_mesaver}"
# r1+r2: small sed-based edits inherited from r4 (verbatim from the
# sibling PKGBUILD — keep in sync).
sed -i 's|\.KHR_robustness2 = PAN_ARCH >= 10,|.KHR_robustness2 = true,|' src/panfrost/vulkan/panvk_vX_physical_device.c
sed -i 's|\.EXT_robustness2 = PAN_ARCH >= 10,|.EXT_robustness2 = true,|' src/panfrost/vulkan/panvk_vX_physical_device.c
sed -i 's|\.nullDescriptor = PAN_ARCH >= 10,|.nullDescriptor = true,|' src/panfrost/vulkan/panvk_vX_physical_device.c
sed -i 's|bool has_vk1_1 = PAN_ARCH >= 10;|bool has_vk1_1 = true;|' src/panfrost/vulkan/panvk_vX_physical_device.c
sed -i 's|bool has_vk1_2 = PAN_ARCH >= 10;|bool has_vk1_2 = true;|' src/panfrost/vulkan/panvk_vX_physical_device.c
# r3: EXT_transform_feedback for Bifrost.
patch -p1 < "${srcdir}/0003-panvk-bifrost-vk-ext-transform-feedback.patch"
# r4: XFB primitive decomposition NIR pass.
patch -p1 < "${srcdir}/0004-panvk-bifrost-xfb-primitive-decomposition.patch"
# video: VK_KHR_video_decode_h264 via V4L2-hantro.
patch -p1 < "${srcdir}/0005-panvk-bifrost-video-KHR-video-decode-h264.patch"
# Sanity-check r1..r4 (inherited).
grep -q "KHR_robustness2 = true," src/panfrost/vulkan/panvk_vX_physical_device.c
grep -q "EXT_robustness2 = true," src/panfrost/vulkan/panvk_vX_physical_device.c
grep -q "nullDescriptor = true," src/panfrost/vulkan/panvk_vX_physical_device.c
grep -q "has_vk1_1 = true;" src/panfrost/vulkan/panvk_vX_physical_device.c
grep -q "has_vk1_2 = true;" src/panfrost/vulkan/panvk_vX_physical_device.c
grep -q "EXT_transform_feedback = PAN_ARCH < 9," src/panfrost/vulkan/panvk_vX_physical_device.c
test -f src/panfrost/vulkan/jm/panvk_vX_cmd_xfb.c
grep -q "panvk_per_arch(nir_lower_xfb)" src/panfrost/vulkan/panvk_vX_shader.c
test -f src/panfrost/vulkan/panvk_vX_xfb_lower.c
# Sanity-check video patch landed.
grep -q "KHR_video_queue = PAN_ARCH < 9 && panvk_v4l2_probe_hantro()" \
src/panfrost/vulkan/panvk_vX_physical_device.c
grep -q "PANVK_QUEUE_FAMILY_VIDEO_DECODE" src/panfrost/vulkan/panvk_device.h
test -f src/panfrost/vulkan/panvk_video_decode.c
test -f src/panfrost/vulkan/panvk_video_decode.h
test -f src/panfrost/vulkan/panvk_v4l2.c
test -f src/panfrost/vulkan/panvk_v4l2_h264.c
test -f src/panfrost/vulkan/panvk_v4l2_h264_slice_header.c
test -f src/panfrost/vulkan/panvk_v4l2_h264_slice_header.h
grep -q "panvk_v4l2_h264_slice_header.c" src/panfrost/vulkan/meson.build
grep -q "panvk_video_queue_submit_noop" src/panfrost/vulkan/panvk_vX_device.c
}
build() {
cd "mesa-${_mesaver}"
# Mirror r4's narrow build profile.
meson setup build/ \
--prefix=/usr \
--libdir=lib \
--buildtype=release \
-Dvulkan-drivers=panfrost \
-Dgallium-drivers= \
-Dplatforms=wayland,x11 \
-Dglx=disabled \
-Degl=disabled \
-Dgles1=disabled \
-Dgles2=disabled \
-Dvulkan-layers= \
-Dtools= \
-Dgallium-rusticl=false \
-Dmicrosoft-clc=disabled
meson compile -C build
}
package() {
cd "${srcdir}/mesa-${_mesaver}"
# Co-install path — parallel to r4's /usr/lib/panvk-bifrost/.
install -Dm755 build/src/panfrost/vulkan/libvulkan_panfrost.so \
"$pkgdir/usr/lib/panvk-bifrost-video/libvulkan_panfrost.so"
# ICD JSON pointing at the video build. Opt-in via VK_ICD_FILENAMES;
# NOT in /usr/share/vulkan/icd.d/ so it doesn't override stock or r4.
install -Dm644 "$srcdir/icd.json" \
"$pkgdir/usr/lib/panvk-bifrost-video/icd.json"
}
+40
View File
@@ -0,0 +1,40 @@
# mesa-panvk-bifrost-video
Patched Mesa `libvulkan_panfrost.so` that **adds `VK_KHR_video_decode_h264`** on Mali Bifrost SBCs (PAN_ARCH 6/7, RK3566/RK3568 class hardware), backed by the SoC's V4L2-stateless **hantro** VPU.
This is a **sibling** of [mesa-panvk-bifrost](../mesa-panvk-bifrost/) (the r4 package that exposes Bifrost to Chromium's Vulkan compositor). Pick this one when the consumer wants Vulkan **video decode** in addition; pick r4 for compositor-only.
## Status
Phase 4 byte-exact validated 2026-05-21: 48/48 unique BBB display frames decoded by this package are byte-identical to `ffmpeg+libva-v4l2-request-fourier` running on the same hantro hardware. Phase 5 second-model review completed; all load-bearing findings addressed. First publish via marfrit-packages CI 2026-05-22 (PR #79 merge did not auto-fire Actions; this re-trigger restores the standard build/sign/publish path).
## How to use
```sh
# Co-installs alongside r4 and stock mesa.
sudo pacman -S mesa-panvk-bifrost-video
# Opt in (not on the default loader search path).
export VK_ICD_FILENAMES=/usr/lib/panvk-bifrost-video/icd.json
export PAN_I_WANT_A_BROKEN_VULKAN_DRIVER=1 # mesa-upstream gate
# Run a Vulkan video consumer.
vulkan-video-dec-simple-test -i your.h264 --codec h264 --noPresent --maxFrameCount 50
# or
ffmpeg -hwaccel vulkan -i your.mp4 ...
```
## Phase 1 limitations
Documented in source comments and worth knowing before relying on this in production:
- **Single video session per device.** Concurrent `VkVideoSessionKHR` on the same device clobber each other (`active_video` singleton). Sufficient for current single-stream consumers.
- **Synchronous decode at record time.** The full V4L2 ioctl dance runs to completion inside `vkCmdDecodeVideoKHR`. No pipelining. Throughput is bounded by hantro's ~1.16× realtime on 1080p H.264.
- **Hardcoded `/dev/video1` + `/dev/media0`.** Matches RK3566/68 but won't work on other SoCs without a topology-walk port (see `libva-v4l2-request-fourier` for the full version).
- **Bitstream source buffer assumed HOST_VISIBLE.** True on panvk-bifrost (no DEVICE_LOCAL-only memory types exist), but the code silently skips decode if the app bound the buffer to non-host-visible memory.
## Co-existence
- Installs to `/usr/lib/panvk-bifrost-video/` — parallel to r4's `/usr/lib/panvk-bifrost/` and stock `/usr/lib/`.
- Opt-in via `VK_ICD_FILENAMES`; does NOT register itself in `/usr/share/vulkan/icd.d/`.
- Three drivers coexist without conflict; the user picks at runtime which to use.
+7
View File
@@ -0,0 +1,7 @@
{
"ICD": {
"api_version": "1.4.335",
"library_path": "/usr/lib/panvk-bifrost-video/libvulkan_panfrost.so"
},
"file_format_version": "1.0.1"
}
@@ -0,0 +1,629 @@
diff -urN a/src/panfrost/vulkan/meson.build b/src/panfrost/vulkan/meson.build
--- a/src/panfrost/vulkan/meson.build 2026-05-21 14:04:02.529474145 +0200
+++ b/src/panfrost/vulkan/meson.build 2026-05-21 14:04:04.106755486 +0200
@@ -123,6 +123,7 @@
'panvk_vX_nir_lower_input_attachment_loads.c',
'panvk_vX_sampler.c',
'panvk_vX_shader.c',
+ 'panvk_vX_xfb_lower.c',
sha1_h,
]
diff -urN a/src/panfrost/vulkan/panvk_shader.h b/src/panfrost/vulkan/panvk_shader.h
--- a/src/panfrost/vulkan/panvk_shader.h 2026-05-21 14:04:02.525251986 +0200
+++ b/src/panfrost/vulkan/panvk_shader.h 2026-05-21 14:04:04.084251800 +0200
@@ -154,6 +154,8 @@
/* aligned_u64 attribute below inserts the 4-byte alignment gap
* after num_vertices automatically — no explicit pad needed. */
aligned_u64 xfb_address[4]; /* iter13: 4 transform feedback buffer base addresses */
+ uint32_t xfb_topology; /* iter17: panvk_xfb_topology enum value */
+ uint32_t xfb_output_count; /* iter17: per-instance output verts after decomp */
#endif
int32_t first_vertex;
int32_t base_instance;
@@ -569,4 +571,76 @@
struct pan_compute_dim local_size, const void *bin_ptr, size_t bin_size,
struct panvk_shader **shader_out);
+
+#if PAN_ARCH < 9
+/* iter17: encoding for vs.xfb_topology sysval. Maps VkPrimitiveTopology values
+ * we need to distinguish at shader runtime for XFB capture. LIST topologies
+ * use the iter13 single-store fast path; non-LIST need per-vertex decomposition. */
+enum panvk_xfb_topology {
+ PANVK_XFB_TOPO_LIST = 0,
+ PANVK_XFB_TOPO_LINE_STRIP = 1,
+ PANVK_XFB_TOPO_TRI_STRIP = 2,
+ PANVK_XFB_TOPO_TRI_FAN = 3,
+ PANVK_XFB_TOPO_LINE_LIST_ADJ = 4,
+ PANVK_XFB_TOPO_LINE_STRIP_ADJ = 5,
+ PANVK_XFB_TOPO_TRI_LIST_ADJ = 6,
+ PANVK_XFB_TOPO_TRI_STRIP_ADJ = 7,
+};
+
+#include "panvk_macros.h"
+struct nir_shader;
+bool panvk_per_arch(nir_lower_xfb)(struct nir_shader *nir);
+
+/* Map VkPrimitiveTopology to panvk_xfb_topology enum (driver-side helper). */
+static inline uint32_t
+panvk_vk_topology_to_xfb_enum(VkPrimitiveTopology topo)
+{
+ switch (topo) {
+ case VK_PRIMITIVE_TOPOLOGY_LINE_STRIP:
+ return PANVK_XFB_TOPO_LINE_STRIP;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_STRIP:
+ return PANVK_XFB_TOPO_TRI_STRIP;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_FAN:
+ return PANVK_XFB_TOPO_TRI_FAN;
+ case VK_PRIMITIVE_TOPOLOGY_LINE_LIST_WITH_ADJACENCY:
+ return PANVK_XFB_TOPO_LINE_LIST_ADJ;
+ case VK_PRIMITIVE_TOPOLOGY_LINE_STRIP_WITH_ADJACENCY:
+ return PANVK_XFB_TOPO_LINE_STRIP_ADJ;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_LIST_WITH_ADJACENCY:
+ return PANVK_XFB_TOPO_TRI_LIST_ADJ;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_STRIP_WITH_ADJACENCY:
+ return PANVK_XFB_TOPO_TRI_STRIP_ADJ;
+ case VK_PRIMITIVE_TOPOLOGY_POINT_LIST:
+ case VK_PRIMITIVE_TOPOLOGY_LINE_LIST:
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_LIST:
+ default:
+ return PANVK_XFB_TOPO_LIST;
+ }
+}
+
+/* Compute the per-instance output vertex count for a given (topology, input count). */
+static inline uint32_t
+panvk_xfb_output_count(VkPrimitiveTopology topo, uint32_t input_count)
+{
+ switch (topo) {
+ case VK_PRIMITIVE_TOPOLOGY_LINE_STRIP:
+ return input_count >= 1 ? 2u * (input_count - 1u) : 0u;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_STRIP:
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_FAN:
+ return input_count >= 2 ? 3u * (input_count - 2u) : 0u;
+ case VK_PRIMITIVE_TOPOLOGY_LINE_LIST_WITH_ADJACENCY:
+ return (input_count / 4u) * 2u;
+ case VK_PRIMITIVE_TOPOLOGY_LINE_STRIP_WITH_ADJACENCY:
+ return input_count >= 3 ? 2u * (input_count - 3u) : 0u;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_LIST_WITH_ADJACENCY:
+ return (input_count / 6u) * 3u;
+ case VK_PRIMITIVE_TOPOLOGY_TRIANGLE_STRIP_WITH_ADJACENCY:
+ return input_count >= 6 ? 3u * (input_count / 2u - 2u) : 0u;
+ default:
+ return input_count; /* LIST topologies: 1:1 mapping */
+ }
+}
+#endif
+
+
#endif
diff -urN a/src/panfrost/vulkan/panvk_vX_cmd_draw.c b/src/panfrost/vulkan/panvk_vX_cmd_draw.c
--- a/src/panfrost/vulkan/panvk_vX_cmd_draw.c 2026-05-21 14:04:02.528576354 +0200
+++ b/src/panfrost/vulkan/panvk_vX_cmd_draw.c 2026-05-21 14:04:04.091357598 +0200
@@ -727,6 +727,20 @@
/* iter13: VK_EXT_transform_feedback sysvals — always set (per draw),
* reflect bound XFB state. set_gfx_sysval is a no-op if value unchanged. */
set_gfx_sysval(cmdbuf, dirty_sysvals, vs.num_vertices, info->vertex.count);
+
+ /* iter17: XFB primitive-decomposition sysvals.
+ * xfb_topology = enum value for the current bound topology.
+ * xfb_output_count = per-instance output vertex count after decomposition.
+ * For LIST topologies, output_count == input vertex count and the shader
+ * takes the iter13 single-store fast path. */
+ {
+ VkPrimitiveTopology vk_topo =
+ cmdbuf->vk.dynamic_graphics_state.ia.primitive_topology;
+ uint32_t topo_enum = panvk_vk_topology_to_xfb_enum(vk_topo);
+ uint32_t out_count = panvk_xfb_output_count(vk_topo, info->vertex.count);
+ set_gfx_sysval(cmdbuf, dirty_sysvals, vs.xfb_topology, topo_enum);
+ set_gfx_sysval(cmdbuf, dirty_sysvals, vs.xfb_output_count, out_count);
+ }
{
const struct panvk_cmd_graphics_state *_gfx = &cmdbuf->state.gfx;
/* iter13: default each XFB buffer address to PAN_SHADER_OOB_ADDRESS
diff -urN a/src/panfrost/vulkan/panvk_vX_shader.c b/src/panfrost/vulkan/panvk_vX_shader.c
--- a/src/panfrost/vulkan/panvk_vX_shader.c 2026-05-21 14:04:02.527576494 +0200
+++ b/src/panfrost/vulkan/panvk_vX_shader.c 2026-05-21 14:04:04.098356619 +0200
@@ -895,7 +895,10 @@
nir->info.has_transform_feedback_varyings) {
NIR_PASS(_, nir, nir_opt_constant_folding);
NIR_PASS(_, nir, nir_io_add_intrinsic_xfb_info);
- NIR_PASS(_, nir, pan_nir_lower_xfb);
+ /* iter17: panvk-specific replacement for pan_nir_lower_xfb that handles
+ * primitive decomposition for non-LIST topologies. Single-store LIST
+ * fast path matches iter13 behavior. */
+ NIR_PASS(_, nir, panvk_per_arch(nir_lower_xfb));
}
#endif
}
diff -urN a/src/panfrost/vulkan/panvk_vX_xfb_lower.c b/src/panfrost/vulkan/panvk_vX_xfb_lower.c
--- a/src/panfrost/vulkan/panvk_vX_xfb_lower.c 1970-01-01 01:00:00.000000000 +0100
+++ b/src/panfrost/vulkan/panvk_vX_xfb_lower.c 2026-05-21 14:04:04.115354242 +0200
@@ -0,0 +1,486 @@
+/*
+ * Copyright © 2026 mfritsche / claude-noether
+ * SPDX-License-Identifier: MIT
+ *
+ * iter17: panvk-specific replacement for pan_nir_lower_xfb that handles
+ * primitive decomposition for transform_feedback on non-LIST topologies
+ * (TRIANGLE_STRIP/FAN, LINE_STRIP, *_WITH_ADJACENCY).
+ *
+ * Approach: emit a topology dispatch at the start of each store_output
+ * lowering. The shader reads vs.xfb_topology sysval at runtime and branches
+ * into per-topology emission logic. For each affected topology, the lowered
+ * code emits guarded conditional stores — one per primitive this vertex
+ * contributes to, computing the output buffer position via primitive index
+ * and slot within the decomposed primitive.
+ *
+ * For LIST topologies (POINT/LINE/TRIANGLE LIST), takes a fast path that
+ * matches iter13's single-store behavior.
+ *
+ * For TRIANGLE_FAN, the central vertex (v=0) contributes to ALL primitives
+ * as slot 2 — handled via a NIR loop bounded by num_vertices.
+ *
+ * See ~/src/panvk-bifrost/iter17/phase{0,1,2}_*.md for full design context.
+ */
+
+#include "panvk_macros.h"
+
+#if PAN_ARCH < 9
+
+#include "panvk_shader.h"
+
+#include "compiler/nir/nir_builder.h"
+#include "pan_nir.h"
+
+#include <vulkan/vulkan_core.h>
+
+/* ----- Address arithmetic ----- */
+
+static nir_def *
+xfb_store_addr(nir_builder *b, nir_def *buf, nir_def *out_idx,
+ uint16_t stride, uint16_t offset_bytes)
+{
+ nir_def *byte_off = nir_iadd_imm(b,
+ nir_imul_imm(b, out_idx, stride), offset_bytes);
+ return nir_iadd(b, buf, nir_u2u64(b, byte_off));
+}
+
+static void
+emit_list_store(nir_builder *b, nir_def *buf, nir_def *output_count,
+ nir_def *instance_id, nir_def *raw_vid, nir_def *value,
+ uint16_t stride, uint16_t offset_bytes)
+{
+ nir_def *out_idx = nir_iadd(b,
+ nir_imul(b, instance_id, output_count), raw_vid);
+ nir_def *addr = xfb_store_addr(b, buf, out_idx, stride, offset_bytes);
+ nir_store_global(b, value, addr);
+}
+
+static void
+emit_prim_store(nir_builder *b, nir_def *buf, nir_def *output_count,
+ nir_def *instance_id, nir_def *eligible,
+ nir_def *prim_idx, nir_def *slot,
+ uint32_t verts_per_prim,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ nir_push_if(b, eligible);
+ {
+ nir_def *out_idx = nir_iadd(b,
+ nir_imul(b, instance_id, output_count),
+ nir_iadd(b, nir_imul_imm(b, prim_idx, verts_per_prim), slot));
+ nir_def *addr = xfb_store_addr(b, buf, out_idx, stride, offset_bytes);
+ nir_store_global(b, value, addr);
+ }
+ nir_pop_if(b, NULL);
+}
+
+/* ----- Per-topology emission ----- */
+
+/* TRIANGLE_STRIP: vertex v contributes to prims v, v-1, v-2 (per eligibility). */
+static void
+emit_tri_strip(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ nir_def *Nm2 = nir_iadd_imm(b, N, -2);
+ nir_def *Nm1 = nir_iadd_imm(b, N, -1);
+
+ /* Prim v, slot 0: v < N-2 */
+ emit_prim_store(b, buf, output_count, instance_id,
+ nir_ult(b, v, Nm2),
+ v, nir_imm_int(b, 0), 3, value, stride, offset_bytes);
+
+ /* Prim v-1, slot = 1 if prim even else 2: 1 <= v < N-1 */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -1);
+ nir_def *parity = nir_iand_imm(b, prim, 1u);
+ nir_def *slot = nir_iadd_imm(b, parity, 1);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 1)),
+ nir_ult(b, v, Nm1));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, slot, 3, value, stride, offset_bytes);
+ }
+
+ /* Prim v-2, slot = 2 if prim even else 1: 2 <= v < N */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -2);
+ nir_def *parity = nir_iand_imm(b, prim, 1u);
+ nir_def *slot = nir_isub(b, nir_imm_int(b, 2), parity);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 2)),
+ nir_ult(b, v, N));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, slot, 3, value, stride, offset_bytes);
+ }
+}
+
+/* LINE_STRIP: vertex v contributes to prim v slot 0 + prim v-1 slot 1. */
+static void
+emit_line_strip(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ nir_def *Nm1 = nir_iadd_imm(b, N, -1);
+
+ /* Prim v, slot 0: v < N-1 */
+ emit_prim_store(b, buf, output_count, instance_id,
+ nir_ult(b, v, Nm1),
+ v, nir_imm_int(b, 0), 2, value, stride, offset_bytes);
+
+ /* Prim v-1, slot 1: 1 <= v < N */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -1);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 1)),
+ nir_ult(b, v, N));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, nir_imm_int(b, 1), 2, value, stride, offset_bytes);
+ }
+}
+
+/* TRIANGLE_FAN: prim p emits {p+1, p+2, 0}.
+ * vertex v=0: contributes to ALL prims as slot 2 (loop required)
+ * vertex v>=1: contributes to prim v-1 as slot 0 (if 1 <= v <= N-2)
+ * vertex v>=2: contributes to prim v-2 as slot 1 (if 2 <= v <= N-1)
+ */
+static void
+emit_tri_fan(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ nir_def *Nm1 = nir_iadd_imm(b, N, -1);
+ nir_def *Nm2 = nir_iadd_imm(b, N, -2);
+
+ /* Prim v-1, slot 0: 1 <= v < N-1 */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -1);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 1)),
+ nir_ult(b, v, Nm1));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, nir_imm_int(b, 0), 3, value, stride, offset_bytes);
+ }
+
+ /* Prim v-2, slot 1: 2 <= v < N */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -2);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 2)),
+ nir_ult(b, v, N));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, nir_imm_int(b, 1), 3, value, stride, offset_bytes);
+ }
+
+ /* Central vertex (v == 0): loop over all prims, write to slot 2. */
+ nir_push_if(b, nir_ieq_imm(b, v, 0));
+ {
+ nir_variable *p_var = nir_local_variable_create(b->impl,
+ glsl_uint_type(), "fan_p");
+ nir_store_var(b, p_var, nir_imm_int(b, 0), 0x1);
+ nir_push_loop(b);
+ {
+ nir_def *p = nir_load_var(b, p_var);
+ nir_push_if(b, nir_uge(b, p, Nm2));
+ {
+ nir_jump(b, nir_jump_break);
+ }
+ nir_pop_if(b, NULL);
+
+ nir_def *out_idx = nir_iadd(b,
+ nir_imul(b, instance_id, output_count),
+ nir_iadd_imm(b, nir_imul_imm(b, p, 3), 2));
+ nir_def *addr = xfb_store_addr(b, buf, out_idx, stride, offset_bytes);
+ nir_store_global(b, value, addr);
+
+ nir_store_var(b, p_var, nir_iadd_imm(b, p, 1), 0x1);
+ }
+ nir_pop_loop(b, NULL);
+ }
+ nir_pop_if(b, NULL);
+}
+
+/* LINE_LIST_WITH_ADJACENCY: 4-vertex groups [4i..4i+3]; output {4i+1, 4i+2}.
+ * v contributes if v%4 == 1: prim v/4 slot 0
+ * v contributes if v%4 == 2: prim v/4 slot 1
+ */
+static void
+emit_line_list_adj(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ (void)N; /* eligibility is mod-based, not range-based */
+ nir_def *vmod4 = nir_iand_imm(b, v, 3u);
+ nir_def *prim = nir_ushr_imm(b, v, 2); /* v / 4 */
+
+ emit_prim_store(b, buf, output_count, instance_id,
+ nir_ieq_imm(b, vmod4, 1),
+ prim, nir_imm_int(b, 0), 2, value, stride, offset_bytes);
+
+ emit_prim_store(b, buf, output_count, instance_id,
+ nir_ieq_imm(b, vmod4, 2),
+ prim, nir_imm_int(b, 1), 2, value, stride, offset_bytes);
+}
+
+/* LINE_STRIP_WITH_ADJACENCY: prim p emits {p+1, p+2}.
+ * v contributes to prim v-1 slot 0 (1 <= v <= N-2)
+ * v contributes to prim v-2 slot 1 (2 <= v <= N-1)
+ */
+static void
+emit_line_strip_adj(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ nir_def *Nm1 = nir_iadd_imm(b, N, -1);
+ nir_def *Nm2 = nir_iadd_imm(b, N, -2);
+
+ /* Prim v-1, slot 0: 1 <= v <= N-2 ⇔ v >= 1 AND v <= N-2 ⇔ v >= 1 AND v < N-1 */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -1);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 1)),
+ nir_ult(b, v, Nm1));
+ (void)Nm2;
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, nir_imm_int(b, 0), 2, value, stride, offset_bytes);
+ }
+
+ /* Prim v-2, slot 1: 2 <= v <= N-1 ⇔ v >= 2 AND v < N */
+ {
+ nir_def *prim = nir_iadd_imm(b, v, -2);
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 2)),
+ nir_ult(b, v, N));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, nir_imm_int(b, 1), 2, value, stride, offset_bytes);
+ }
+}
+
+/* TRIANGLE_LIST_WITH_ADJACENCY: 6-vertex groups; output {6i, 6i+2, 6i+4}.
+ * v contributes if v%6 == 0: prim v/6 slot 0
+ * v contributes if v%6 == 2: prim v/6 slot 1
+ * v contributes if v%6 == 4: prim v/6 slot 2
+ */
+static void
+emit_tri_list_adj(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ (void)N;
+ nir_def *vmod6 = nir_umod_imm(b, v, 6);
+ nir_def *prim = nir_udiv_imm(b, v, 6);
+
+ for (uint32_t slot = 0; slot < 3; slot++) {
+ emit_prim_store(b, buf, output_count, instance_id,
+ nir_ieq_imm(b, vmod6, slot * 2),
+ prim, nir_imm_int(b, slot), 3, value, stride, offset_bytes);
+ }
+}
+
+/* TRIANGLE_STRIP_WITH_ADJACENCY: prim i emits:
+ * even i: {2i, 2i+2, 2i+4} (slots 0, 1, 2 ← input indices 2i, 2i+2, 2i+4)
+ * odd i: {2i, 2i+4, 2i+2} (slots 0, 1, 2 ← input indices 2i, 2i+4, 2i+2)
+ *
+ * Only EVEN input vertices contribute (since all output indices are 2*something).
+ * For even input v:
+ * prim v/2 slot 0 (always, if v/2 < N/2-2)
+ * prim (v-2)/2 slot 1 if (v-2)/2 even, slot 2 if odd (when v >= 2)
+ * prim (v-4)/2 slot 2 if (v-4)/2 even, slot 1 if odd (when v >= 4)
+ */
+static void
+emit_tri_strip_adj(nir_builder *b, nir_def *v, nir_def *N,
+ nir_def *buf, nir_def *output_count, nir_def *instance_id,
+ nir_def *value, uint16_t stride, uint16_t offset_bytes)
+{
+ /* Bail for odd input vertices — they never contribute. */
+ nir_def *v_is_even = nir_ieq_imm(b, nir_iand_imm(b, v, 1u), 0);
+ nir_push_if(b, v_is_even);
+ {
+ nir_def *N_half = nir_ushr_imm(b, N, 1);
+ nir_def *max_prim = nir_iadd_imm(b, N_half, -2); /* N/2 - 2 */
+ nir_def *v_half = nir_ushr_imm(b, v, 1);
+
+ /* Prim v/2 slot 0: v/2 < N/2 - 2 */
+ emit_prim_store(b, buf, output_count, instance_id,
+ nir_ult(b, v_half, max_prim),
+ v_half, nir_imm_int(b, 0), 3, value, stride, offset_bytes);
+
+ /* Prim (v-2)/2 = v/2 - 1: v >= 2 AND prim < N/2-2 */
+ {
+ nir_def *prim = nir_iadd_imm(b, v_half, -1);
+ nir_def *parity = nir_iand_imm(b, prim, 1u);
+ nir_def *slot = nir_iadd_imm(b, parity, 1); /* even→1, odd→2 */
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 2)),
+ nir_ult(b, prim, max_prim));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, slot, 3, value, stride, offset_bytes);
+ }
+
+ /* Prim (v-4)/2 = v/2 - 2: v >= 4 AND prim < N/2-2 */
+ {
+ nir_def *prim = nir_iadd_imm(b, v_half, -2);
+ nir_def *parity = nir_iand_imm(b, prim, 1u);
+ nir_def *slot = nir_isub(b, nir_imm_int(b, 2), parity); /* even→2, odd→1 */
+ nir_def *eligible = nir_iand(b,
+ nir_uge(b, v, nir_imm_int(b, 4)),
+ nir_ult(b, prim, max_prim));
+ emit_prim_store(b, buf, output_count, instance_id, eligible,
+ prim, slot, 3, value, stride, offset_bytes);
+ }
+ }
+ nir_pop_if(b, NULL);
+}
+
+/* ----- Main lowering: per store_output XFB channel ----- */
+
+static void
+lower_xfb_output_iter17(nir_builder *b, nir_intrinsic_instr *intr,
+ unsigned channel_idx, unsigned num_components,
+ unsigned buffer, unsigned offset_words)
+{
+ assert(buffer < MAX_XFB_BUFFERS);
+ assert(nir_intrinsic_component(intr) == 0);
+
+ uint16_t stride = b->shader->info.xfb_stride[buffer] * 4;
+ assert(stride != 0);
+ uint16_t offset_bytes = offset_words * 4;
+
+ BITSET_SET(b->shader->info.system_values_read, SYSTEM_VALUE_VERTEX_ID_ZERO_BASE);
+ BITSET_SET(b->shader->info.system_values_read, SYSTEM_VALUE_INSTANCE_ID);
+
+ nir_def *topology = load_sysval(b, graphics, 32, vs.xfb_topology);
+ nir_def *out_count = load_sysval(b, graphics, 32, vs.xfb_output_count);
+ nir_def *N = nir_load_num_vertices(b);
+ nir_def *v = nir_load_raw_vertex_id_pan(b);
+ nir_def *instance = nir_load_instance_id(b);
+ nir_def *buf = nir_load_xfb_address(b, 64, .base = buffer);
+
+ nir_def *src = intr->src[0].ssa;
+ nir_component_mask_t mask = nir_component_mask(num_components);
+ nir_def *value = nir_channels(b, src, mask << channel_idx);
+
+ /* Topology dispatch ladder. LIST first (fast path). */
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LIST));
+ {
+ emit_list_store(b, buf, out_count, instance, v, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ /* iter17 Janet Finding 3: gate all non-LIST emission on
+ * output_count > 0. For degenerate input counts (N < min required
+ * for the topology), output_count is 0 and we must emit NO stores
+ * — otherwise N-2 / N-3 / etc. arithmetic underflows in the
+ * eligibility predicates and we falsely fire stores. */
+ nir_push_if(b, nir_ult(b, nir_imm_int(b, 0), out_count));
+ {
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_TRI_STRIP));
+ {
+ emit_tri_strip(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LINE_STRIP));
+ {
+ emit_line_strip(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_TRI_FAN));
+ {
+ emit_tri_fan(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LINE_LIST_ADJ));
+ {
+ emit_line_list_adj(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LINE_STRIP_ADJ));
+ {
+ emit_line_strip_adj(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_TRI_LIST_ADJ));
+ {
+ emit_tri_list_adj(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_push_else(b, NULL);
+ {
+ /* TRI_STRIP_ADJ — last case */
+ emit_tri_strip_adj(b, v, N, buf, out_count, instance, value,
+ stride, offset_bytes);
+ }
+ nir_pop_if(b, NULL);
+ }
+ nir_pop_if(b, NULL);
+ }
+ nir_pop_if(b, NULL);
+ }
+ nir_pop_if(b, NULL);
+ }
+ nir_pop_if(b, NULL);
+ }
+ nir_pop_if(b, NULL);
+ }
+ nir_pop_if(b, NULL); /* Janet Finding 3: close output_count > 0 guard */
+ }
+ nir_pop_if(b, NULL);
+}
+
+/* Mirror of pan_nir_lower_xfb's lower_xfb: load_vertex_id rewrite +
+ * dispatch store_output through our topology-aware emission. */
+static bool
+lower_xfb_iter17(nir_builder *b, nir_intrinsic_instr *intr,
+ UNUSED void *data)
+{
+ if (intr->intrinsic == nir_intrinsic_load_vertex_id) {
+ b->cursor = nir_instr_remove(&intr->instr);
+ nir_def *repl = nir_iadd(b, nir_load_raw_vertex_id_pan(b),
+ nir_load_raw_vertex_offset_pan(b));
+ nir_def_rewrite_uses(&intr->def, repl);
+ return true;
+ }
+
+ if (intr->intrinsic != nir_intrinsic_store_output)
+ return false;
+
+ bool progress = false;
+ b->cursor = nir_before_instr(&intr->instr);
+
+ /* io_xfb has only out[0,1]; the other 2 channels are in io_xfb2.
+ * Outer loop selects which annotation; inner picks which channel. */
+ for (unsigned i = 0; i < 2; ++i) {
+ nir_io_xfb xfb = i ? nir_intrinsic_io_xfb2(intr)
+ : nir_intrinsic_io_xfb(intr);
+ for (unsigned j = 0; j < 2; ++j) {
+ if (!xfb.out[j].num_components)
+ continue;
+ lower_xfb_output_iter17(b, intr, i * 2 + j, xfb.out[j].num_components,
+ xfb.out[j].buffer, xfb.out[j].offset);
+ progress = true;
+ }
+ }
+
+ if (progress)
+ nir_instr_remove(&intr->instr);
+ return progress;
+}
+
+bool
+panvk_per_arch(nir_lower_xfb)(nir_shader *nir)
+{
+ return nir_shader_intrinsics_pass(
+ nir, lower_xfb_iter17, nir_metadata_control_flow, NULL);
+}
+
+#endif /* PAN_ARCH < 9 */
@@ -0,0 +1,50 @@
From: marfrit-packages noether <claude-noether@reauktion.de>
Subject: [PATCH] panvk: report fragmentStoresAndAtomics = true on Bifrost
Backports Mesa main's unconditional advertisement of
fragmentStoresAndAtomics for panvk (snapshot ref: src/panfrost/vulkan/
panvk_vX_physical_device.c at commit-time 2026-05-06; the line reads
`.fragmentStoresAndAtomics = true,` on main with no PAN_ARCH gate).
Motivation: Chromium Dawn's WebGPU initializer in
third_party/dawn/src/dawn/native/vulkan/PhysicalDeviceVk.cpp:250
unconditionally rejects any Vulkan adapter that doesn't advertise this
feature, causing Dawn to fall back to the SwiftShader CPU adapter
on PineTab2 / RK3566 / Mali-G52 r1 MC1 (PAN_ARCH 7). With this patch the
device advertises true, satisfying Dawn's gate. Tracked at
https://git.reauktion.de/marfrit/panvk-bifrost/issues/2.
The disjunction with `instance->force_enable_shader_atomics` is
preserved as a kill-switch: in compiler terms it's dead code
(`true || X == true`), but it leaves the DRI option
`pan_force_enable_shader_atomics` semantically wired so future
rebases or downstream debugging can see the link to the runtime knob.
Caveat: the existing DRI option's description in src/util/driconf.h
still labels this as "may not work reliably and is for debug purposes
only". Mesa main's choice to ship it as default-on for all panvk
architectures (including Bifrost, which is non-conformant per the
PAN_I_WANT_A_BROKEN_VULKAN_DRIVER gate) reflects an upstream judgment
that the practical risk is acceptable. Verify-before-ship for this
package: dEQP-VK.glsl.atomic_operations.* + dEQP-VK.image.store.*
deltas vs the r4 baseline must show no new fails. Pass counts may rise
(tests that previously NotSupported now run); the load-bearing line is
the Failed column staying at zero.
---
src/panfrost/vulkan/panvk_vX_physical_device.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/src/panfrost/vulkan/panvk_vX_physical_device.c b/src/panfrost/vulkan/panvk_vX_physical_device.c
--- a/src/panfrost/vulkan/panvk_vX_physical_device.c
+++ b/src/panfrost/vulkan/panvk_vX_physical_device.c
@@ -280,8 +280,7 @@
.vertexPipelineStoresAndAtomics =
(PAN_ARCH >= 13 && instance->enable_vertex_pipeline_stores_atomics) ||
instance->force_enable_shader_atomics,
- .fragmentStoresAndAtomics =
- (PAN_ARCH >= 10) || instance->force_enable_shader_atomics,
+ .fragmentStoresAndAtomics = true || instance->force_enable_shader_atomics,
.shaderTessellationAndGeometryPointSize = false,
.shaderImageGatherExtended = true,
.shaderStorageImageExtendedFormats = true,
@@ -0,0 +1,51 @@
From: marfrit-packages noether <claude-noether@reauktion.de>
Subject: [PATCH] panvk: advertise VK_EXT_legacy_dithering on Bifrost
Backports Mesa main's flip — vanilla 26.0.6 doesn't have the extension
in the panvk advertisement list; main does (line 172 / 647 on snapshot
617da94, 2026-05-06).
VK_EXT_legacy_dithering exposes the classic OpenGL-style dithering
behavior to Vulkan apps. Pure-software composition; no new HW path.
ARM's own libmali driver release r51p0 (BXODROIDN2PL, Aug 2024) lists
this extension in its Vulkan implementation for ODROID-N2 boards
using the same Mali-G52 architecture family — confirms ARM ships it
for Mali-G52-class hardware.
Consumer benefit: dithering matters for low-bit-depth framebuffers
(RGB565 / RGB5A1 — common on portable / battery-saving renders)
where banding is visible. DXVK / vkd3d-proton both opt in when
available.
Verify-before-ship: vulkaninfo lists the extension and
VkPhysicalDeviceLegacyDitheringFeaturesEXT.legacyDithering == true.
Cross-refs:
- marfrit/panvk-bifrost research/r6_r7_mali_g52_feature_audit_2026-05-24.md
- ARM blob r51p0 strings dump (in-blob extension confirmed)
---
src/panfrost/vulkan/panvk_vX_physical_device.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/src/panfrost/vulkan/panvk_vX_physical_device.c b/src/panfrost/vulkan/panvk_vX_physical_device.c
--- a/src/panfrost/vulkan/panvk_vX_physical_device.c
+++ b/src/panfrost/vulkan/panvk_vX_physical_device.c
@@ -156,6 +156,7 @@
.EXT_image_drm_format_modifier = true,
.EXT_image_robustness = true,
.EXT_index_type_uint8 = true,
+ .EXT_legacy_dithering = true,
.EXT_line_rasterization = true,
.EXT_load_store_op_none = true,
.EXT_non_seamless_cube_map = true,
@@ -552,6 +553,9 @@
/* VK_EXT_multisampled_render_to_single_sampled */
.multisampledRenderToSingleSampled = true,
+
+ /* VK_EXT_legacy_dithering */
+ .legacyDithering = true,
};
}
@@ -0,0 +1,103 @@
From: marfrit-packages noether <claude-noether@reauktion.de>
Subject: [PATCH] panvk-bifrost: fix XFB store channel-extract for packed varyings
iter19 — fixes a reliable SIGSEGV during vkCreateGraphicsPipeline on any
shader that uses XFB-bound varyings declared with non-zero `layout
(component=N)` qualifiers. Surfaced by
dEQP-VK.transform_feedback.simple.holes_vert; backtrace lands 11 frames
into libvulkan_panfrost.so called from `vkt::TransformFeedback::
TransformFeedbackHolesInstance::iterate`.
Root cause: `lower_xfb_output_iter17` (and upstream `lower_xfb_output`,
which carries a `// TODO` on the same assertion) computes the source-
channel mask as `mask << channel_idx`, where `channel_idx` is the
varying-location component (0..3) but `src` only contains channels for
the source-side range starting at `nir_intrinsic_component(intr)`. For
`flat out float vegeta` declared with `component=2`, NIR emits
`store_output src=<vec1>, component=2`, and the lowering computes
`mask << 2` against a single-component src — out-of-range; the
resulting malformed nir_def then segfaults inside downstream NIR
constant-folding (`nir_constant_expressions.c::evaluate_*`).
The assertion `assert(nir_intrinsic_component(intr) == 0)` was inherited
from upstream `pan_nir_lower_xfb.c` as a documented `// TODO`; release
builds (-DNDEBUG) elide it. The fix translates `channel_idx` to the
source-channel space by subtracting `nir_intrinsic_component(intr)`
before shifting the mask, and replaces the elided asserts with explicit
release-mode guards (the patch closes the same release-mode-elision
class as the original bug).
Verified on PineTab2 (Mali-G52 r1 MC1, PAN_ARCH 7) against vulkan-cts
1.3.10.0:
- holes_vert / holes_extra_draw_vert no longer SIGSEGV (now Fail on
color-check; that is a separate iter20 finding — the rasterized
varying gets removed alongside the XFB-bound one).
- basic_*: 36/36 Pass. depth_clip_*: 1 Pass + 4 NotSupported.
lines_or_triangles*: 16 NotSupported. 0 Fail across the full set.
- holes_geom / holes_extra_draw_geom remain NotSupported
(geometryShader not on G52) — unchanged.
Caveat: max_output_components_64/_128/_256 were never reached on the
r5 sweep (watchdog killed transform_feedback after the holes_vert
crash). With this fix in place, those tests now run and surface
*their own pre-existing* coredumps — confirmed on shipped r6 baseline
too. They are NOT regressions from this patch; they are latent crashes
unmasked by it. iter20+ territory.
Phase 5 (2nd-model) review: APPROVE WITH CHANGES (non-blocking).
Changes applied: release-mode defensive guards on both preconditions
plus a dispatcher-side comment clarifying the i*2+j semantics.
Cross-refs:
- iter19/phase{0,1,2,3}_holes_vert*.md in panvk-bifrost repo
---
src/panfrost/vulkan/panvk_vX_xfb_lower.c | 24 +++++++++++++++++++++---
1 file changed, 21 insertions(+), 3 deletions(-)
diff --git a/src/panfrost/vulkan/panvk_vX_xfb_lower.c b/src/panfrost/vulkan/panvk_vX_xfb_lower.c
@@ -339,7 +339,20 @@
unsigned buffer, unsigned offset_words)
{
assert(buffer < MAX_XFB_BUFFERS);
- assert(nir_intrinsic_component(intr) == 0);
+
+ /* iter19: nir_intrinsic_component(intr) is the source-channel base —
+ * for a packed varying like `layout (location=0, component=2) flat out
+ * float vegeta`, NIR emits store_output with component=2 and a single-
+ * component src. The XFB iteration index `channel_idx` (0..3) is the
+ * varying-location component, not the source channel. Translate by
+ * subtracting the base before shifting the mask. Fixes the long-
+ * standing `assert(nir_intrinsic_component(intr) == 0) // TODO` in
+ * upstream pan_nir_lower_xfb that surfaces on holes_vert. */
+ const unsigned base_comp = nir_intrinsic_component(intr);
+ /* Defensive against release-build elision: this is precisely the
+ * bug class the patch is fixing, so don't re-introduce it. */
+ if (channel_idx < base_comp)
+ return;
uint16_t stride = b->shader->info.xfb_stride[buffer] * 4;
assert(stride != 0);
@@ -357,7 +370,11 @@
nir_def *src = intr->src[0].ssa;
nir_component_mask_t mask = nir_component_mask(num_components);
- nir_def *value = nir_channels(b, src, mask << channel_idx);
+ const unsigned src_channel = channel_idx - base_comp;
+ /* Same defensive class as the channel_idx >= base_comp guard above. */
+ if (src_channel + num_components > src->num_components)
+ return;
+ nir_def *value = nir_channels(b, src, mask << src_channel);
/* Topology dispatch ladder. LIST first (fast path). */
nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LIST));
@@ -465,6 +482,9 @@
for (unsigned j = 0; j < 2; ++j) {
if (!xfb.out[j].num_components)
continue;
+ /* `i*2+j` is the varying-location component (0..3) — io_xfb covers
+ * slots 0..1, io_xfb2 covers 2..3. The leaf translates this into
+ * a source-channel index by subtracting nir_intrinsic_component(intr). */
lower_xfb_output_iter17(b, intr, i * 2 + j, xfb.out[j].num_components,
xfb.out[j].buffer, xfb.out[j].offset);
progress = true;
+63 -3
View File
@@ -30,11 +30,11 @@
pkgname=mesa-panvk-bifrost
_mesaver=26.0.6
pkgver=26.0.6.r3
pkgver=26.0.6.r7
pkgrel=1
pkgdesc="Patched Mesa libvulkan_panfrost.so exposing Bifrost-gen Mali to Vulkan apps (panvk-bifrost campaign)"
arch=('aarch64')
url="https://github.com/marfrit/panvk-bifrost"
url="https://git.reauktion.de/marfrit/panvk-bifrost"
license=('MIT')
# We co-install at /usr/lib/panvk-bifrost/ so no conflicts with stock mesa.
@@ -80,6 +80,10 @@ source=(
"0001-panvk-expose-robustness2-nullDescriptor-bifrost.patch"
"0002-panvk-expose-vulkan-1.1-1.2-on-bifrost.patch"
"0003-panvk-bifrost-vk-ext-transform-feedback.patch"
"0004-panvk-bifrost-xfb-primitive-decomposition.patch"
"0005-panvk-bifrost-fragment-stores-atomics.patch"
"0006-panvk-bifrost-legacy-dithering.patch"
"0007-panvk-bifrost-xfb-component-base-fix.patch"
"brave-vulkan"
"icd.json"
)
@@ -90,6 +94,10 @@ sha256sums=(
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP'
'SKIP'
)
prepare() {
@@ -116,6 +124,46 @@ prepare() {
# reports "Hardware accelerated" across the board for the affected paths).
patch -p1 < "${srcdir}/0003-panvk-bifrost-vk-ext-transform-feedback.patch"
# iter17: XFB primitive decomposition for non-LIST topologies (TRI_STRIP,
# TRI_FAN, LINE_STRIP, *_WITH_ADJACENCY). Replacement panvk-specific
# NIR pass (panvk_per_arch(nir_lower_xfb)) substituted for upstream
# pan_nir_lower_xfb. Closes the 162 dEQP-VK winding_* failures from
# iter15 (958 P / 81 F / 0 Crash on full XFB CTS — remaining 81 fails
# are by-design resume_* tests, transformFeedbackDraw=false).
# Phase-doc context: ~/src/panvk-bifrost/iter17/phase{0,1,2,4,5,6,8}_*.md.
patch -p1 < "${srcdir}/0004-panvk-bifrost-xfb-primitive-decomposition.patch"
# r5 (2026-05-23): advertise .fragmentStoresAndAtomics = true on Bifrost
# to satisfy Chromium Dawn's WebGPU init gate
# (third_party/dawn/src/dawn/native/vulkan/PhysicalDeviceVk.cpp:250).
# Backports Mesa main's unconditional flip (same line as on main as of
# 2026-05-06). Disjunction with instance->force_enable_shader_atomics
# is preserved as a documented kill-switch even though the compiler
# folds it away. Closes marfrit/panvk-bifrost#2.
# Verify-before-ship: dEQP-VK.glsl.atomic_operations.* and
# dEQP-VK.image.store.* show no new Failed vs r4 baseline.
patch -p1 < "${srcdir}/0005-panvk-bifrost-fragment-stores-atomics.patch"
# r6 (2026-05-25): advertise VK_EXT_legacy_dithering. Backports Mesa
# main's unconditional flip. Pure-software composition; vk_render_pass
# already gates on enabled_features.legacyDithering and panvk_vX_blend
# + pan_format already plumb the dithered BLEND descriptor (BFMT2 table
# has MALI_BLEND_AU encodings for RGB565/RGB5A1/RGBA4/RGB10A2 on
# PAN_ARCH 7). Closes the EXT_legacy_dithering gap surfaced by
# marfrit/panvk-bifrost research/r6_r7_*. ARM blob r51p0 confirms the
# extension as Mali-G52-architecture supported.
patch -p1 < "${srcdir}/0006-panvk-bifrost-legacy-dithering.patch"
# r7 (2026-05-25): XFB store channel-extract fix for packed varyings.
# Eliminates a reliable SIGSEGV in vkCreateGraphicsPipeline whenever
# an XFB-bound vertex output is declared with non-zero
# `layout (component=N)`. Surfaced by dEQP-VK.transform_feedback.
# simple.holes_vert (now Fails on color-check rather than crashing;
# the color-check residual is a separate iter20 finding).
# Phase-doc context: ~/src/panvk-bifrost/iter19/phase{0,1,2,3}_*.md.
# Phase 5 reviewed; release-mode-elision defensive guards applied.
patch -p1 < "${srcdir}/0007-panvk-bifrost-xfb-component-base-fix.patch"
# Sanity-check the patches landed.
grep -q "KHR_robustness2 = true," src/panfrost/vulkan/panvk_vX_physical_device.c
grep -q "EXT_robustness2 = true," src/panfrost/vulkan/panvk_vX_physical_device.c
@@ -124,8 +172,20 @@ prepare() {
grep -q "has_vk1_2 = true;" src/panfrost/vulkan/panvk_vX_physical_device.c
# iter13 sanity:
grep -q "EXT_transform_feedback = PAN_ARCH < 9," src/panfrost/vulkan/panvk_vX_physical_device.c
grep -q "pan_nir_lower_xfb" src/panfrost/vulkan/panvk_vX_shader.c
test -f src/panfrost/vulkan/jm/panvk_vX_cmd_xfb.c
# iter17 sanity: pan_nir_lower_xfb call site has been replaced; new file present.
grep -q "panvk_per_arch(nir_lower_xfb)" src/panfrost/vulkan/panvk_vX_shader.c
# r5 sanity: fragmentStoresAndAtomics = true patch landed
grep -q "fragmentStoresAndAtomics = true ||" src/panfrost/vulkan/panvk_vX_physical_device.c
# r6 sanity: VK_EXT_legacy_dithering advertised
grep -q '\.EXT_legacy_dithering = true,' src/panfrost/vulkan/panvk_vX_physical_device.c
grep -q '\.legacyDithering = true,' src/panfrost/vulkan/panvk_vX_physical_device.c
grep -q "xfb_topology" src/panfrost/vulkan/panvk_shader.h
grep -q "panvk_xfb_topology" src/panfrost/vulkan/panvk_shader.h
test -f src/panfrost/vulkan/panvk_vX_xfb_lower.c
# r7 sanity: XFB channel-base correction landed
grep -q "iter19: nir_intrinsic_component(intr) is the source-channel base" src/panfrost/vulkan/panvk_vX_xfb_lower.c
grep -q "mask << src_channel" src/panfrost/vulkan/panvk_vX_xfb_lower.c
}
build() {
Vendored Executable
+85
View File
@@ -0,0 +1,85 @@
#!/bin/bash
# Build aish_<ver>_all.deb from this directory using dpkg-deb directly.
# Run from inside the runner container, which has dpkg installed.
#
# Matches the lmcp build-deb.sh pattern: no dh/debhelper, no Build-Depends
# beyond `dpkg`, structurally a normal apt package (Architecture: all).
set -euo pipefail
PKGVER=0.1.0
UPSTREAM_TAG=v0.1.0
PKGREL=1
AISH_TARBALL_SHA256=9ebc3939e028832e39391ae33efacb5ec9bcd99d123cbc8ca1cd6ca9a640b5b5
HERE=$(dirname "$(readlink -f "$0")")
# Reproducible build: pin all file mtimes + ar member timestamps to a fixed
# epoch tied to this packaging release (aish v0.1.0 — 2026-05-25 00:00 UTC).
# Without this, repeat builds produce different byte streams and reprepro
# refuses re-includes with "size expected: X, got: Y".
export SOURCE_DATE_EPOCH=1779667200
work=$(mktemp -d)
trap "rm -rf $work" EXIT
cd "$work"
curl --connect-timeout 10 --max-time 600 --retry 3 --retry-delay 5 -sSLfo aish.tar.gz \
"https://git.reauktion.de/marfrit/aish/archive/${UPSTREAM_TAG}.tar.gz"
echo "$AISH_TARBALL_SHA256 aish.tar.gz" | sha256sum -c
tar xzf aish.tar.gz
ROOT="$work/pkgroot"
LIBDIR="$ROOT/usr/share/lua/5.1/aish"
mkdir -p "$ROOT/DEBIAN" \
"$LIBDIR/ffi" \
"$LIBDIR/vendor" \
"$ROOT/usr/bin" \
"$ROOT/usr/share/doc/aish/examples"
# Top-level modules
for m in main broker context executor history mcp renderer repl router safety secrets; do
cp "aish/${m}.lua" "$LIBDIR/${m}.lua"
done
# FFI bindings
for m in curl libc pty readline; do
cp "aish/ffi/${m}.lua" "$LIBDIR/ffi/${m}.lua"
done
# Vendored dependencies
cp aish/vendor/dkjson.lua "$LIBDIR/vendor/dkjson.lua"
# Launch wrapper
install -m 755 aish/bin/aish "$ROOT/usr/bin/aish"
# Documentation + example config
cp aish/README.md "$ROOT/usr/share/doc/aish/"
cp aish/LICENSE "$ROOT/usr/share/doc/aish/"
cp aish/examples/config.lua "$ROOT/usr/share/doc/aish/examples/"
cp "$HERE/debian/copyright" "$ROOT/usr/share/doc/aish/copyright"
cp "$HERE/debian/changelog" "$ROOT/usr/share/doc/aish/changelog.Debian"
gzip -9 -n "$ROOT/usr/share/doc/aish/changelog.Debian"
cat > "$ROOT/DEBIAN/control" <<EOF
Package: aish
Version: ${PKGVER}-${PKGREL}
Section: shells
Priority: optional
Architecture: all
Depends: luajit, libreadline8t64 | libreadline8, libcurl4t64 | libcurl4
Maintainer: Markus Fritsche <mfritsche@reauktion.de>
Homepage: https://git.reauktion.de/marfrit/aish
Description: AI-augmented conversational shell (LuaJIT, FFI-only)
aish is an interactive REPL that interleaves shell execution and
language-model conversation against llama.cpp HTTP brokers. Pure
LuaJIT 2.x with FFI bindings to libcurl, GNU readline, and libc.
.
Modules install under /usr/share/lua/5.1/aish/. The launcher is
/usr/bin/aish. Example configuration is at
/usr/share/doc/aish/examples/config.lua (copy to
~/.config/aish/config.lua and adapt).
EOF
# Build the .deb. Output to current dir of the caller.
DEB_OUT=aish_${PKGVER}-${PKGREL}_all.deb
dpkg-deb --root-owner-group --build "$ROOT" "$HERE/$DEB_OUT"
echo "built: $HERE/$DEB_OUT"
+14
View File
@@ -0,0 +1,14 @@
aish (0.1.0-1) bookworm trixie; urgency=medium
* Initial release packaged for marfrit overlay repo. Phases 0-10
complete (102 closed issues): local llama.cpp + cloud broker
routing via hossenfelder, MCP tool calls with confirm-gate and
per-tool auto_approve, Chuck Norris autonomous mode with
destructive-op heuristic, cross-session memory.jsonl, multi-model
routing + GBNF grammar passthrough, project file-tree context,
cost/usage observability, /tokenize endpoint integration, project
overlay (.aish.lua + sha256-pinned trust ledger), cloud preplanner
→ local executor split.
* Source-of-truth: git.reauktion.de/marfrit/aish, tagged v0.1.0.
-- Markus Fritsche <mfritsche@reauktion.de> Mon, 25 May 2026 00:00:00 +0000
+20
View File
@@ -0,0 +1,20 @@
Source: aish
Section: shells
Priority: optional
Maintainer: Markus Fritsche <mfritsche@reauktion.de>
Standards-Version: 4.6.2
Homepage: https://git.reauktion.de/marfrit/aish
Package: aish
Architecture: all
Depends: ${misc:Depends}, luajit, libreadline8t64 | libreadline8, libcurl4t64 | libcurl4
Description: AI-augmented conversational shell (LuaJIT, FFI-only)
aish is an interactive REPL that interleaves shell execution and language-
model conversation against llama.cpp HTTP brokers. Implementation is pure
LuaJIT 2.x with FFI bindings to libcurl, GNU readline, and libc — no C
extensions, no build step.
.
Modules install under /usr/share/lua/5.1/aish/. The launcher is
/usr/bin/aish. Example configuration is at
/usr/share/doc/aish/examples/config.lua (copy to ~/.config/aish/config.lua
and adapt).
+30
View File
@@ -0,0 +1,30 @@
Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
Upstream-Name: aish
Source: https://git.reauktion.de/marfrit/aish
Files: *
Copyright: 2026 Markus Fritsche <mfritsche@reauktion.de>
License: MIT
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
.
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND.
Files: vendor/dkjson.lua
Copyright: 2010-2014 David Heiko Kolf
License: MIT
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including the rights to use, copy,
modify, merge, publish, distribute, sublicense, and/or sell copies of the
Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions: the above copyright notice and this
permission notice shall be included in all copies or substantial portions of
the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND.
+150
View File
@@ -0,0 +1,150 @@
#!/bin/bash
# Package pre-built chromium-fourier artifacts into a .deb.
#
# Chromium can't be compiled natively on any available aarch64 runner
# (clang version wall — chromium requires its internal clang fork).
# The build is cross-compiled on CT 220 (data, x86_64 Ryzen 7).
# This script expects the build artifacts to exist at BUILD_DIR
# (default: fetched from CT 220 via SSH).
#
# Sibling Arch package: ../../arch/chromium-fourier/PKGBUILD
set -euo pipefail
PKGVER=148.0.7778.178
EPOCH=1
PKGREL=1
ARCH=arm64
HERE=$(dirname "$(readlink -f "$0")")
export SOURCE_DATE_EPOCH=1779854400 # 2026-05-24 09:00 UTC
BUILD_DIR="${BUILD_DIR:-}"
work=$(mktemp -d)
trap "rm -rf $work" EXIT
if [ -z "$BUILD_DIR" ]; then
echo "BUILD_DIR not set — fetching artifacts from CT 220 on data..."
BUILD_DIR="$work/artifacts"
mkdir -p "$BUILD_DIR"
ssh root@data "pct exec 220 -- tar -cf - -C /build/chromium/src/out/Default \
chrome chrome_crashpad_handler \
libEGL.so libGLESv2.so libvk_swiftshader.so libvulkan.so.1 \
vk_swiftshader_icd.json \
chrome_100_percent.pak chrome_200_percent.pak resources.pak \
v8_context_snapshot.bin snapshot_blob.bin icudtl.dat \
locales/" | tar -xf - -C "$BUILD_DIR"
fi
ROOT="$work/pkgroot"
install -Dm755 "$BUILD_DIR/chrome" "$ROOT/usr/lib/chromium/chromium"
install -Dm755 "$BUILD_DIR/chrome_crashpad_handler" "$ROOT/usr/lib/chromium/chrome_crashpad_handler"
for so in libEGL.so libGLESv2.so libvk_swiftshader.so libvulkan.so.1; do
[ -f "$BUILD_DIR/$so" ] && install -Dm755 "$BUILD_DIR/$so" "$ROOT/usr/lib/chromium/$so"
done
for icd in "$BUILD_DIR"/*_icd.json; do
[ -f "$icd" ] && install -Dm644 "$icd" "$ROOT/usr/lib/chromium/$(basename "$icd")"
done
for f in chrome_100_percent.pak chrome_200_percent.pak resources.pak \
v8_context_snapshot.bin snapshot_blob.bin icudtl.dat; do
[ -f "$BUILD_DIR/$f" ] && install -Dm644 "$BUILD_DIR/$f" "$ROOT/usr/lib/chromium/$f"
done
if [ -d "$BUILD_DIR/locales" ]; then
install -dm755 "$ROOT/usr/lib/chromium/locales"
cp -r "$BUILD_DIR/locales/"* "$ROOT/usr/lib/chromium/locales/"
fi
install -dm755 "$ROOT/usr/bin"
cat > "$ROOT/usr/bin/chromium-fourier" <<'LAUNCHER'
#!/bin/bash
USER_HANDLES_VULKAN=0
for arg in "$@"; do
case "$arg" in
--use-vulkan*|--enable-features=*Vulkan*|--disable-features=*Vulkan*|--use-angle=vulkan*)
USER_HANDLES_VULKAN=1
break
;;
esac
done
vulkan_default=()
if [ "$USER_HANDLES_VULKAN" = 0 ]; then
vulkan_default=(--disable-features=Vulkan)
fi
exec /usr/lib/chromium/chromium \
--ozone-platform=wayland \
--use-gl=angle --use-angle=gles \
--enable-features=AcceleratedVideoDecoder \
"${vulkan_default[@]}" \
"$@"
LAUNCHER
chmod 0755 "$ROOT/usr/bin/chromium-fourier"
mkdir -p "$ROOT/usr/share/doc/chromium-fourier" "$ROOT/DEBIAN"
install -Dm644 "$HERE/debian/copyright" \
"$ROOT/usr/share/doc/chromium-fourier/copyright"
install -Dm644 "$HERE/debian/changelog" \
"$ROOT/usr/share/doc/chromium-fourier/changelog.Debian"
gzip -9 -n "$ROOT/usr/share/doc/chromium-fourier/changelog.Debian"
ISIZE=$(du -sk "$ROOT" | awk '{print $1}')
cat > "$ROOT/DEBIAN/control" <<EOF
Package: chromium-fourier
Version: ${EPOCH}:${PKGVER}-${PKGREL}
Section: web
Priority: optional
Architecture: ${ARCH}
Installed-Size: ${ISIZE}
Depends: libasound2,
libatk-bridge2.0-0,
libatk1.0-0,
libcairo2,
libcups2,
libdbus-1-3,
libdrm2,
libexpat1,
libfontconfig1,
libfreetype6,
libgbm1,
libglib2.0-0,
libgtk-3-0,
libnspr4,
libnss3,
libpango-1.0-0,
libpulse0,
libva2,
libwayland-client0,
libx11-6,
libxcb1,
libxkbcommon0,
libpipewire-0.3-0,
fonts-liberation,
v4l-utils
Provides: www-browser
Conflicts: chromium
Maintainer: Markus Fritsche <mfritsche@reauktion.de>
Homepage: https://www.chromium.org/
Description: Chromium with V4L2 HW video decode for Rockchip (Wayland + mainline)
Chromium ${PKGVER} with three patches enabling V4L2 hardware video
decoding on mainline Linux / Wayland for Rockchip SoCs (RK3566 hantro,
RK3588 VDPU381).
.
Cross-compiled from x86_64 using chromium's bundled clang (upstream
LLVM cannot compile chromium). Runtime target is aarch64.
.
Patches: enable-v4l2-decoder-default, wayland-allow-direct-egl-gles2,
nv12-external-oes-on-modifier-external-only.
.
Launcher at /usr/bin/chromium-fourier defaults to Wayland + ANGLE/GLES
with Vulkan disabled (panvk on RK3566 breaks V4L2 dispatch).
EOF
DEB_OUT="chromium-fourier_${EPOCH}%3a${PKGVER}-${PKGREL}_${ARCH}.deb"
dpkg-deb --root-owner-group --build "$ROOT" "$HERE/$DEB_OUT"
echo "built: $HERE/$DEB_OUT"
+8
View File
@@ -0,0 +1,8 @@
chromium-fourier (1:148.0.7778.178-1) trixie; urgency=medium
* Chromium 148.0.7778.178 with V4L2 HW decode patches for Rockchip.
* Cross-compiled from x86_64 using chromium's bundled clang.
* Three fourier patches: enable-v4l2-decoder-default,
wayland-allow-direct-egl-gles2, nv12-external-oes-on-modifier-external-only.
-- Markus Fritsche <mfritsche@reauktion.de> Sat, 24 May 2026 09:00:00 +0200
+32
View File
@@ -0,0 +1,32 @@
Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
Upstream-Name: Chromium
Upstream-Contact: chromium-dev@chromium.org
Source: https://www.chromium.org/
Files: *
Copyright: The Chromium Authors
License: BSD-3-Clause
Files: debian/*
Copyright: 2026 Markus Fritsche <mfritsche@reauktion.de>
License: BSD-3-Clause
License: BSD-3-Clause
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
.
1. Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
.
3. Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED.
+3 -3
View File
@@ -14,9 +14,9 @@
# Sibling userspace package: ../daedalus-v4l2/build-deb.sh
set -euo pipefail
UPSTREAM_COMMIT=79256dc7ef41f83873ca9c23db20f5888858e65d
PKGVER=0.1.0+r28+g79256dc
PKGREL=1 # reset for new upstream pin (79256dc — H.264 B-frame reorder fix); still carries the #64 multi-kernel postinst fix
UPSTREAM_COMMIT=872eec505eb91b561892d02a0526749348ddc121
PKGVER=0.1.0+r45+g872eec5
PKGREL=1 # reset for new upstream pin (872eec5 — PROTO_MAX_PAYLOAD 64 KiB -> 1 MiB, closes #19); lock-step with daedalus-v4l2 0.1.0+r45+g872eec5 REQUIRED
MODULE_NAME=daedalus_v4l2
HERE=$(dirname "$(readlink -f "$0")")
+57
View File
@@ -1,3 +1,60 @@
daedalus-v4l2-dkms (0.1.0+r45+g872eec5-1) bookworm trixie; urgency=medium
* Bump to 872eec5 — picks up daedalus-v4l2 PR #20 (closes #19).
Wire-protocol cap DAEDALUS_PROTO_MAX_PAYLOAD raised from 64 KiB
to 1 MiB in include/daedalus_v4l2_proto.h. The kernel module
inherits the larger DAEDALUS_MAX_BITSTREAM via the same #define
and daedalus_fill_output_fmt now reports OUTPUT_MPLANE
sizeimage = ~1 MiB instead of 65484.
* Skips the r33 -> r45 commit range — between 5d8b436 and 872eec5
only one kernel/include change landed (the PROTO_MAX_PAYLOAD
bump above). The intervening daemon-only bumps (r37 / r39 /
r41 / r43) didn't touch kernel/ or include/ at all.
* Effective wire cap is min(kernel, daemon) — lock-step install
WITH daedalus-v4l2 0.1.0+r45+g872eec5 REQUIRED.
* Allocations (kmemdup / kmalloc on payload, vb2 plane backing)
are dynamic and sized per-payload at runtime; the bump only
sets the ceiling. KMALLOC_MAX_SIZE on aarch64 SLUB is several
MiB so 1 MiB is well within bounds.
-- Markus Fritsche <mfritsche@reauktion.de> Fri, 22 May 2026 21:00:00 +0000
daedalus-v4l2-dkms (0.1.0+r33+g5d8b436-1) bookworm trixie; urgency=medium
* Bump to 5d8b436 — reverts daedalus-v4l2 PRs #7 + #8. Kernel
module returns to the pre-#7 buf_done_and_job_finish completion
model: no src/dst lifecycle decoupling, no parked dst_bufs, no
1:1-contract violation against libva-v4l2-request-fourier
(closes daedalus-v4l2#9 + #10 as won't-fix at this layer; proper
fix tracked at daedalus-v4l2#11).
* Wire-protocol drops 1 → 0; lock-step install with daedalus-v4l2
0.1.0+r33+g5d8b436 REQUIRED.
* Carries forward the #64 multi-kernel postinst fix.
-- Markus Fritsche <mfritsche@reauktion.de> Thu, 21 May 2026 14:50:00 +0000
daedalus-v4l2-dkms (0.1.0+r30+g6ffe92b-1) bookworm trixie; urgency=medium
* Bump to 6ffe92b — fixes the kernel panic regression introduced
by 79256dc's split-completion design (closes daedalus-v4l2#8).
`device_run` now removes both src + dst from `m2m_ctx`'s
rdy_queue at pickup time, not at `buf_done` time. Without
this, after `SRC_CONSUMED`'s `job_finish` released the m2m
scheduler, the NEXT `device_run` saw the still-queued parked
dst_buf and paired it with a fresh src — two inflight entries
referencing the same vb2_buffer, the later `HAS_PIXELS`
triggered list_del on an already-detached list_head, smashing
the rdy_queue → hard reboot on Pi CM5 during `mpv vaapi-copy`
playback of 720p H.264 (2026-05-21).
* Wire protocol unchanged — DAEDALUS_PROTO_VERSION stays at 1.
Daemon (userspace daedalus-v4l2 package) need NOT bump in
lockstep with this DKMS update; the existing
daedalus-v4l2 0.1.0+r28+g79256dc is wire-compatible with
daedalus-v4l2-dkms 0.1.0+r30+g6ffe92b.
* Carries forward the #64 multi-kernel postinst fix.
-- Markus Fritsche <mfritsche@reauktion.de> Thu, 21 May 2026 14:00:00 +0000
daedalus-v4l2-dkms (0.1.0+r28+g79256dc-1) bookworm trixie; urgency=medium
* Bump to 79256dc — H.264 B-frame display reorder fix (closes
+41 -11
View File
@@ -11,16 +11,23 @@
# Upstream repo: https://git.reauktion.de/reauktion/daedalus-v4l2
set -euo pipefail
# Same pin as the Arch PKGBUILD. 79256dc = "kernel + daemon: H.264
# B-frame display reorder fix (closes #6)" — adds the wire-protocol
# src_pts / output_src_pts / RESP_FRAME flags split that lets H.264
# streams with B-frames preserve display order through libva → kernel
# daemon. PROTO_VERSION bumps 0 → 1; lock-step userspace + kernel
# rebuild REQUIRED (daedalus-v4l2-dkms build-deb.sh pinned to the same
# commit).
UPSTREAM_COMMIT=79256dc7ef41f83873ca9c23db20f5888858e65d
PKGVER=0.1.0+r28+g79256dc
PKGREL=1 # reset for new upstream pin (79256dc — H.264 B-frame reorder fix)
# 6e6dfa1 = picks up daedalus-v4l2 PR #16 — daemon now dlopens
# the Kwiboo fourier fork's libavcodec.so.62 / libavformat.so.62 /
# libavutil.so.60 at /opt/fourier instead of Debian-stock soname
# 61/61/59. First step on the daedalus-fourier substitution arc
# (daedalus-v4l2#11): routes the daemon through the libavcodec
# source tree we own in marfrit-packages. Headers + .pc files
# come from ffmpeg-v4l2-request-fourier (installed by the CI
# workflow before this script runs; see PKG_CONFIG_PATH below).
UPSTREAM_COMMIT=872eec505eb91b561892d02a0526749348ddc121
PKGVER=0.1.0+r45+g872eec5
PKGREL=1 # reset for new upstream pin (872eec5 — PROTO_MAX_PAYLOAD 64 KiB -> 1 MiB, closes #19); lock-step with daedalus-v4l2-dkms 0.1.0+r45+g872eec5 REQUIRED
# daedalus-fourier pin. d87239d = marfrit/daedalus-fourier PR #1 merge
# (install rules + pkg-config, enables this consumer to find_package
# + link). Bump in lockstep with the upstream daemon when daedalus-
# fourier's API or installed shaders are changed by a new consumer.
DAEDALUS_FOURIER_COMMIT=d87239d8172307d9a1b93c95cbed116d175b85cc
HERE=$(dirname "$(readlink -f "$0")")
@@ -30,14 +37,37 @@ export SOURCE_DATE_EPOCH=1779231600
work=$(mktemp -d)
trap "rm -rf $work" EXIT
# --- daedalus-fourier: fetch + build + install to per-build prefix ---
#
# Static-linked into the daemon, so the temp prefix is only for the
# duration of this build script. Requires libvulkan-dev + glslang-tools
# on the runner (already needed for the daedalus-fourier benches).
FOURIER_PREFIX=$work/fourier-prefix
mkdir -p "$FOURIER_PREFIX"
cd "$work"
curl --connect-timeout 10 --max-time 600 --retry 3 --retry-delay 5 -sSLfo daedalus-fourier.tar.gz \
"https://git.reauktion.de/marfrit/daedalus-fourier/archive/${DAEDALUS_FOURIER_COMMIT}.tar.gz"
tar xzf daedalus-fourier.tar.gz
cd daedalus-fourier
cmake -B build -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX="$FOURIER_PREFIX"
cmake --build build --target daedalus_core
cmake --install build
# --- daedalus-v4l2: fetch + build daemon against installed daedalus-fourier ---
cd "$work"
curl --connect-timeout 10 --max-time 600 --retry 3 --retry-delay 5 -sSLfo daedalus-v4l2.tar.gz \
"https://git.reauktion.de/reauktion/daedalus-v4l2/archive/${UPSTREAM_COMMIT}.tar.gz"
tar xzf daedalus-v4l2.tar.gz
SRCDIR=daedalus-v4l2
# Build daemon (CMake)
# Build daemon (CMake) — point pkg-config at the daedalus-fourier
# temp prefix so pkg_check_modules(DAEDALUS_FOURIER …) resolves to it.
cd "$SRCDIR/daemon"
PKG_CONFIG_PATH="$FOURIER_PREFIX/lib/pkgconfig:/opt/fourier/lib/pkgconfig" \
cmake -B build -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX=/usr
+136
View File
@@ -1,3 +1,139 @@
daedalus-v4l2 (0.1.0+r45+g872eec5-1) bookworm trixie; urgency=medium
* Bump to 872eec5 — picks up daedalus-v4l2 PR #20 (closes #19).
Wire-protocol cap DAEDALUS_PROTO_MAX_PAYLOAD raised from 64 KiB
to 1 MiB. DAEDALUS_MAX_BITSTREAM follows; daedalus_fill_output_fmt
now reports OUTPUT_MPLANE sizeimage = ~1 MiB instead of 65484.
libva-v4l2-request-fourier's S_FMT-driven OUTPUT-pool resize
finally succeeds; Firefox no longer falls off to libmozavcodec
SW when an H.264 slice exceeds 64 KiB (routine on any
720p+ stream).
* #define-only change in include/daedalus_v4l2_proto.h; struct
layout unchanged. But effective cap is min(kernel, daemon) —
lock-step install of this package WITH
daedalus-v4l2-dkms 0.1.0+r45+g872eec5 REQUIRED.
* Daemon-side allocations are dynamic (malloc-on-payload), so
the practical growth is one ~1 MiB read buffer per daemon
process at startup. Negligible on Pi 5 / 8 GB.
* Picks up the same r43 -> r45 transition as daedalus-v4l2-dkms
(which had been stuck at r33+g5d8b436 since the parking-design
revert because the kernel module didn't change in r37/r39/r41/r43).
-- Markus Fritsche <mfritsche@reauktion.de> Fri, 22 May 2026 21:00:00 +0000
daedalus-v4l2 (0.1.0+r43+g1d8f5af-1) bookworm trixie; urgency=medium
* Bump to 1d8f5af — picks up daedalus-v4l2 PR #18 (closes #17).
Daemon now drops degenerate (<4 byte) bitstreams at the REQ_DECODE
entry instead of letting avcodec_send_packet return
AVERROR_INVALIDDATA. Reply RESP_FRAME with status=
DAEDALUS_DECODE_NO_FRAME so libva's V4L2 surface pool stays
healthy.
* Fixes the Firefox YouTube avc1 pause→resume regression observed
on higgs: libva-v4l2-request-fourier flushes a 3-byte stub
(presumably a bare NAL start code) into OUTPUT_MPLANE at the
pause boundary; the old INVALIDDATA error path made Firefox
fall off to libmozavcodec SW for the rest of the session. With
this filter the daemon logs the sentinel as 'tiny bitstream 3
bytes — dropping as no-op' and the next real REQ_DECODE
proceeds normally.
* Wire protocol unchanged. No daedalus-v4l2-dkms bump needed.
-- Markus Fritsche <mfritsche@reauktion.de> Fri, 22 May 2026 17:30:00 +0000
daedalus-v4l2 (0.1.0+r41+g6e6dfa1-1) bookworm trixie; urgency=medium
* Bump to 6e6dfa1 — daedalus-v4l2 PR #16. Daemon dlopens Kwiboo
fourier fork's libavcodec.so.62 / libavformat.so.62 /
libavutil.so.60 at /opt/fourier instead of Debian-stock
soname 61/61/59. First step on the daedalus-fourier
substitution arc (daedalus-v4l2#11): the next PR series
layers daedalus_recipe_dispatch_h264_* substitution patches
into ffmpeg-v4l2-request-fourier's H264DSPContext NEON init,
reaching the daemon's production decode path.
* Build: PKG_CONFIG_PATH now includes /opt/fourier/lib/pkgconfig
so daemon's pkg_check_modules picks up the Kwiboo .pc files.
* CI workflow build-deps: libavcodec-dev / libavformat-dev /
libavutil-dev (Debian stock 7.1.3) → ffmpeg-v4l2-request-fourier
(provides /opt/fourier/include + .pc files).
* Wire protocol unchanged. No daedalus-v4l2-dkms bump.
-- Markus Fritsche <mfritsche@reauktion.de> Thu, 21 May 2026 21:30:00 +0000
daedalus-v4l2 (0.1.0+r39+g3bc0da1-1) bookworm trixie; urgency=medium
* Bump to 3bc0da1 — picks up daedalus-v4l2 PR #15. Per-frame
`decoder: OK ...` log line gains `decode_us=N` (libavcodec
send_packet + receive_frame wall-clock cost in microseconds).
New `decoder stats` summary line every 60 decoded frames with
codec, fps, avg decode_us, MBs/s throughput, B/MB bitrate.
* Pure observability — no decode-path behaviour change.
Establishes baseline metrics for the substitution work in
daedalus-v4l2#11 step 2 (replacing libavcodec primitives with
daedalus-fourier kernels one cycle at a time).
* On Pi CM5 / bbb 720p H.264 baseline: ~4 ms decode_us / 24 fps
/ 90 K MBs/s — workload is well under 1 % of any single
daedalus-fourier kernel's NEON ceiling.
* Wire protocol unchanged. No daedalus-v4l2-dkms bump needed.
-- Markus Fritsche <mfritsche@reauktion.de> Thu, 21 May 2026 18:30:00 +0000
daedalus-v4l2 (0.1.0+r37+g77e14e5-1) bookworm trixie; urgency=medium
* Bump to 77e14e5 — picks up daedalus-v4l2 PRs #12 + #13.
* #12 (LOW_DELAY half-measure): the daemon now sets
AV_CODEC_FLAG_LOW_DELAY on the H.264 AVCodecContext so libavcodec
emits frames in decode order ~99% of the time (a few stragglers
at GOP boundaries when the stream's SPS num_reorder_frames
overrides the flag). Visible improvement vs the 2-1-4-3
pair-swap on Firefox YouTube + mpv playback; not a permanent
fix (see #11 for the architectural plan).
* #13 (daedalus-fourier linkage): the daemon now pkg-config-links
against the daedalus-fourier kernel library (marfrit/
daedalus-fourier) and logs substrate availability at startup.
No kernels dispatched yet — this is the build-time / link-time
foundation for the H.264 daemon-rewrite plan in #11
(substituting daedalus-fourier IDCT 4×4 / IDCT 8×8 / luma
deblock primitives for libavcodec's per-MB pixel math, one
cycle at a time, measuring CPU saved per substitution).
* Build-deb.sh now fetches + builds + installs daedalus-fourier
(pinned at d87239d, marfrit/daedalus-fourier PR #1) into a
per-build temp prefix, then builds the daemon with
PKG_CONFIG_PATH pointing at it. daedalus-fourier is
statically linked into the daemon binary, so the resulting
.deb has no new runtime deps. Requires libvulkan-dev +
glslang-tools on the CI runner (the daedalus-fourier benches
already needed those).
* Wire protocol unchanged — DAEDALUS_PROTO_VERSION stays at 0.
No daedalus-v4l2-dkms bump needed.
-- Markus Fritsche <mfritsche@reauktion.de> Thu, 21 May 2026 16:30:00 +0000
daedalus-v4l2 (0.1.0+r33+g5d8b436-1) bookworm trixie; urgency=medium
* Bump to 5d8b436 — reverts daedalus-v4l2 PRs #7 + #8 (the parking
design that broke libva-v4l2-request-fourier's 1:1 CAPTURE
contract; see daedalus-v4l2#9 + #10). After daemon-r28+g79256dc
landed, mpv (--hwdec=vaapi-copy) failed pre-playing with
"Unable to dequeue buffer: Resource temporarily unavailable" /
"Failed to end picture decode" because the daemon parked CAPTURE
buffers waiting for libavcodec to release H.264 B-frames in
display order — violating the V4L2 stateless 1:1 contract.
Firefox tolerated the mess (visible "2 1 4 3" pair-swap); mpv
bailed.
* This bump restores f0d4186-equivalent behaviour, plus PR #4
(cosmetic H.264 DECODE_MODE / START_CODE menu controls). PR #7
+ PR #8 wire-protocol additions (src_pts / output_src_pts /
RESP_FRAME flags) are reverted — DAEDALUS_PROTO_VERSION drops
back from 1 → 0. Lock-step install with daedalus-v4l2-dkms
0.1.0+r33+g5d8b436 REQUIRED.
* Visible regression: H.264 B-frame streams in Firefox revert to
the original "2 1 4 3 6 5" pair-swap visual. The proper fix
(concurrent in-flight requests in daemon + display-order reorder
in libva-v4l2-request-fourier) is tracked at daedalus-v4l2#11.
-- Markus Fritsche <mfritsche@reauktion.de> Thu, 21 May 2026 14:50:00 +0000
daedalus-v4l2 (0.1.0+r28+g79256dc-1) bookworm trixie; urgency=medium
* Bump to 79256dc — H.264 B-frame display reorder fix (closes
@@ -0,0 +1,137 @@
From f760c0541586f43334c02611fcb4c212c08ad576 Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Thu, 21 May 2026 21:40:22 +0200
Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 4x4 IDCT through
daedalus-fourier
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
H264DSPContext.idct_add (called per 4x4 block from the intra-4x4
decode path in h264_mb.c) now dispatches through
daedalus_recipe_dispatch_h264_idct4 instead of ff_h264_idct_add_neon.
The recipe layer picks the substrate; for cycle 6 (H.264 IDCT 4x4)
the recipe is CPU NEON, so this is effectively a NEON-to-NEON
substitution with one extra dispatch call and recipe-table lookup.
Provides the first end-to-end exercise of the daedalus-fourier
kernel pack inside the libavcodec.so decode hot path; follow-up
patches wire IDCT 8x8, luma-v deblock, and qpel mc20.
The library context is process-global, lazily initialised under
pthread_once on first call. We pick the no-QPU constructor because
libavcodec.so is loaded into arbitrary host processes
(firefox-fourier, mpv-fourier, daedalus_v4l2_daemon, ...) and we
cannot assume the host has a usable Vulkan instance. Higher cycles
(deblock luma-v, MC) that benefit from the QPU will provision their
own recipe-selected context once that path is wired.
Bulk paths (idct_add16, idct_add16intra, idct_add8 — used for
non-intra4x4 macroblocks) remain on the stock NEON .S implementations
and will be batched through daedalus_recipe_dispatch_h264_idct4 with
n_blocks>1 in a follow-up.
Bit-exact against ff_h264_idct_add_neon (daedalus-fourier cycle 6
green; see marfrit/daedalus-fourier/CYCLE_LOGS.md).
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2.
---
libavcodec/aarch64/Makefile | 3 +-
libavcodec/aarch64/h264_idct_daedalus.c | 49 +++++++++++++++++++++++
libavcodec/aarch64/h264dsp_init_aarch64.c | 3 +-
3 files changed, 53 insertions(+), 2 deletions(-)
create mode 100644 libavcodec/aarch64/h264_idct_daedalus.c
diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
index 41ab025..7b95fb1 100644
--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
@@ -3,7 +3,8 @@ OBJS-$(CONFIG_AC3DSP) += aarch64/ac3dsp_init_aarch64.o
OBJS-$(CONFIG_FDCTDSP) += aarch64/fdctdsp_init_aarch64.o
OBJS-$(CONFIG_FMTCONVERT) += aarch64/fmtconvert_init.o
OBJS-$(CONFIG_H264CHROMA) += aarch64/h264chroma_init_aarch64.o
-OBJS-$(CONFIG_H264DSP) += aarch64/h264dsp_init_aarch64.o
+OBJS-$(CONFIG_H264DSP) += aarch64/h264dsp_init_aarch64.o \
+ aarch64/h264_idct_daedalus.o
OBJS-$(CONFIG_HUFFYUVDSP) += aarch64/huffyuvdsp_init_aarch64.o
OBJS-$(CONFIG_H264PRED) += aarch64/h264pred_init.o
OBJS-$(CONFIG_H264QPEL) += aarch64/h264qpel_init_aarch64.o
diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
new file mode 100644
index 0000000..538d223
--- /dev/null
+++ b/libavcodec/aarch64/h264_idct_daedalus.c
@@ -0,0 +1,49 @@
+/*
+ * H.264 4x4 IDCT + add — daedalus-fourier substitution shim.
+ *
+ * Routes H264DSPContext.idct_add through
+ * daedalus_recipe_dispatch_h264_idct4 instead of ff_h264_idct_add_neon.
+ * The recipe layer picks the substrate (CPU NEON by default for
+ * cycle 6; future cycles may dispatch to V3D opportunistically).
+ *
+ * FFmpeg's 4x4 block memory layout matches daedalus's column-major
+ * convention: block[r + 4*c] = coefficient at (row r, col c). Both
+ * sides destructively zero the block after the transform.
+ *
+ * The library context is process-global and lazily initialised under
+ * pthread_once. We pick the no-QPU constructor here because
+ * libavcodec.so is loaded into arbitrary host processes
+ * (firefox-fourier, mpv-fourier, daedalus_v4l2_daemon, ...) and we
+ * cannot assume the host has a usable Vulkan instance. Higher cycles
+ * (deblock, MC) that benefit from the QPU initialise their own
+ * recipe-selected context once that path is wired.
+ */
+
+#include <pthread.h>
+#include <stddef.h>
+#include <stdint.h>
+
+#include <daedalus.h>
+
+#include "libavutil/attributes.h"
+#include "libavcodec/h264dsp.h"
+
+static daedalus_ctx *g_dctx;
+static pthread_once_t g_dctx_once = PTHREAD_ONCE_INIT;
+
+static void daedalus_ctx_init_once(void)
+{
+ g_dctx = daedalus_ctx_create_no_qpu();
+}
+
+void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+
+void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
+{
+ static const daedalus_h264_block_meta meta = { .dst_off = 0 };
+
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+ daedalus_recipe_dispatch_h264_idct4(g_dctx, dst, (size_t)stride,
+ block, 1, &meta);
+}
diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
index c684574..b993df2 100644
--- a/libavcodec/aarch64/h264dsp_init_aarch64.c
+++ b/libavcodec/aarch64/h264dsp_init_aarch64.c
@@ -66,6 +66,7 @@ void ff_biweight_h264_pixels_4_neon(uint8_t *dst, uint8_t *src, ptrdiff_t stride
int weights, int offset);
void ff_h264_idct_add_neon(uint8_t *dst, int16_t *block, int stride);
+void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
void ff_h264_idct_dc_add_neon(uint8_t *dst, int16_t *block, int stride);
void ff_h264_idct_add16_neon(uint8_t *dst, const int *block_offset,
int16_t *block, int stride,
@@ -139,7 +140,7 @@ av_cold void ff_h264dsp_init_aarch64(H264DSPContext *c, const int bit_depth,
c->biweight_pixels_tab[1] = ff_biweight_h264_pixels_8_neon;
c->biweight_pixels_tab[2] = ff_biweight_h264_pixels_4_neon;
- c->idct_add = ff_h264_idct_add_neon;
+ c->idct_add = ff_h264_idct_add_daedalus;
c->idct_dc_add = ff_h264_idct_dc_add_neon;
c->idct_add16 = ff_h264_idct_add16_neon;
c->idct_add16intra = ff_h264_idct_add16intra_neon;
--
2.47.3
@@ -0,0 +1,107 @@
From 1b286ddb4efaca26ec9b9e290e989fec77dc1c77 Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Fri, 22 May 2026 10:18:21 +0200
Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 8x8 IDCT through
daedalus-fourier
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
H264DSPContext.idct8_add (called per 8x8 block from the High-profile
intra-8x8-DCT decode path in h264_mb.c) now dispatches through
daedalus_recipe_dispatch_h264_idct8 instead of ff_h264_idct8_add_neon.
The recipe layer picks the substrate; for cycle 7 (H.264 IDCT 8x8)
the recipe is CPU NEON, so this is effectively a NEON-to-NEON
substitution layered on top of the cycle-6 IDCT 4x4 wiring. Same
pthread_once global context, same destructive-zero semantics; FFmpeg
column-major 8x8 storage block[r + 8*c] matches daedalus's convention.
Bulk path c->idct8_add4 (used for inter 8x8-DCT macroblocks) remains
on the in-tree NEON .S code and will be batched through
daedalus_recipe_dispatch_h264_idct8 with n_blocks>1 in a follow-up.
Bit-exact against ff_h264_idct8_add_neon (daedalus-fourier cycle 7
green).
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 7.
---
libavcodec/aarch64/h264_idct_daedalus.c | 29 ++++++++++++++++-------
libavcodec/aarch64/h264dsp_init_aarch64.c | 3 ++-
2 files changed, 23 insertions(+), 9 deletions(-)
diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
index 538d223..cbb98af 100644
--- a/libavcodec/aarch64/h264_idct_daedalus.c
+++ b/libavcodec/aarch64/h264_idct_daedalus.c
@@ -1,14 +1,16 @@
/*
- * H.264 4x4 IDCT + add — daedalus-fourier substitution shim.
+ * H.264 4x4 / 8x8 IDCT + add — daedalus-fourier substitution shims.
*
- * Routes H264DSPContext.idct_add through
- * daedalus_recipe_dispatch_h264_idct4 instead of ff_h264_idct_add_neon.
- * The recipe layer picks the substrate (CPU NEON by default for
- * cycle 6; future cycles may dispatch to V3D opportunistically).
+ * Routes H264DSPContext.idct_add → daedalus_recipe_dispatch_h264_idct4
+ * H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
+ * instead of the in-tree ff_h264_idct{,8}_add_neon assembly. The
+ * recipe layer picks the substrate (CPU NEON by default for cycles
+ * 6 + 7; future cycles may dispatch to V3D opportunistically).
*
- * FFmpeg's 4x4 block memory layout matches daedalus's column-major
- * convention: block[r + 4*c] = coefficient at (row r, col c). Both
- * sides destructively zero the block after the transform.
+ * FFmpeg's 4x4 and 8x8 block memory layouts match daedalus's
+ * column-major convention: block[r + N*c] = coefficient at
+ * (row r, col c) for N ∈ {4, 8}. Both sides destructively zero the
+ * block after the transform.
*
* The library context is process-global and lazily initialised under
* pthread_once. We pick the no-QPU constructor here because
@@ -37,6 +39,7 @@ static void daedalus_ctx_init_once(void)
}
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
{
@@ -47,3 +50,13 @@ void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
daedalus_recipe_dispatch_h264_idct4(g_dctx, dst, (size_t)stride,
block, 1, &meta);
}
+
+void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride)
+{
+ static const daedalus_h264_block_meta meta = { .dst_off = 0 };
+
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+ daedalus_recipe_dispatch_h264_idct8(g_dctx, dst, (size_t)stride,
+ block, 1, &meta);
+}
diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
index b993df2..741e551 100644
--- a/libavcodec/aarch64/h264dsp_init_aarch64.c
+++ b/libavcodec/aarch64/h264dsp_init_aarch64.c
@@ -79,6 +79,7 @@ void ff_h264_idct_add8_neon(uint8_t **dest, const int *block_offset,
const uint8_t nnzc[15 * 8]);
void ff_h264_idct8_add_neon(uint8_t *dst, int16_t *block, int stride);
+void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
void ff_h264_idct8_dc_add_neon(uint8_t *dst, int16_t *block, int stride);
void ff_h264_idct8_add4_neon(uint8_t *dst, const int *block_offset,
int16_t *block, int stride,
@@ -146,7 +147,7 @@ av_cold void ff_h264dsp_init_aarch64(H264DSPContext *c, const int bit_depth,
c->idct_add16intra = ff_h264_idct_add16intra_neon;
if (chroma_format_idc <= 1)
c->idct_add8 = ff_h264_idct_add8_neon;
- c->idct8_add = ff_h264_idct8_add_neon;
+ c->idct8_add = ff_h264_idct8_add_daedalus;
c->idct8_dc_add = ff_h264_idct8_dc_add_neon;
c->idct8_add4 = ff_h264_idct8_add4_neon;
} else if (have_neon(cpu_flags) && bit_depth == 10) {
--
2.47.3
@@ -0,0 +1,121 @@
From 68731c41d7ea68be0e912b128cb4e71fb56e8263 Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Fri, 22 May 2026 12:15:16 +0200
Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 luma-v deblock through
daedalus-fourier
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
H264DSPContext.v_loop_filter_luma (non-intra bS<4 vertical luma
deblock, called per macroblock-row edge from the slice deblock
loop) now dispatches through
daedalus_recipe_dispatch_h264_deblock_luma_v instead of
ff_h264_v_loop_filter_luma_neon.
The recipe layer picks the substrate; for cycle 8 the daedalus
docstring marks the kernel "CPU primary; QPU opportunistic", but
the libavcodec.so context here is built with
daedalus_ctx_create_no_qpu — process-global pthread_once init,
shared with cycles 6/7. QPU opportunism stays gated off until a
follow-up adds an explicit feature flag (no implicit Vulkan init
in arbitrary host processes). In the meantime cycle 8 is a
plumbing-only substitution, NEON-to-NEON via the daedalus recipe.
Intra (bS=4) loop filter — c->v_loop_filter_luma_intra — stays on
the in-tree NEON .S code; daedalus's daedalus_h264_deblock_meta
only covers the non-intra path per its docstring.
FFmpeg `int alpha/beta/int8_t tc0[4]` → daedalus_h264_deblock_meta
(int32_t alpha/beta + inline int8_t tc0[4]). pix already points
to row 0 of the bottom block per FFmpeg's deblock convention,
satisfying daedalus's `dst_off >= 4 * dst_stride` constraint.
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 8.
---
libavcodec/aarch64/h264_idct_daedalus.c | 36 +++++++++++++++++++----
libavcodec/aarch64/h264dsp_init_aarch64.c | 4 ++-
2 files changed, 33 insertions(+), 7 deletions(-)
diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
index cbb98af..92365fa 100644
--- a/libavcodec/aarch64/h264_idct_daedalus.c
+++ b/libavcodec/aarch64/h264_idct_daedalus.c
@@ -1,11 +1,14 @@
/*
- * H.264 4x4 / 8x8 IDCT + add — daedalus-fourier substitution shims.
+ * H.264 4x4 / 8x8 IDCT + luma-v deblock — daedalus-fourier substitution shims.
*
- * Routes H264DSPContext.idct_add → daedalus_recipe_dispatch_h264_idct4
- * H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
- * instead of the in-tree ff_h264_idct{,8}_add_neon assembly. The
- * recipe layer picks the substrate (CPU NEON by default for cycles
- * 6 + 7; future cycles may dispatch to V3D opportunistically).
+ * Routes H264DSPContext.idct_add → daedalus_recipe_dispatch_h264_idct4
+ * H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
+ * H264DSPContext.v_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_v
+ * instead of the in-tree ff_h264_*_neon assembly. The recipe layer
+ * picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
+ * is CPU primary with QPU opportunistic — the ctx below is no-QPU,
+ * so cycle 8 stays on the CPU NEON path until a separate change
+ * gates QPU init on a daedalus-fourier feature flag).
*
* FFmpeg's 4x4 and 8x8 block memory layouts match daedalus's
* column-major convention: block[r + N*c] = coefficient at
@@ -40,6 +43,8 @@ static void daedalus_ctx_init_once(void)
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0);
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
{
@@ -60,3 +65,22 @@ void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride)
daedalus_recipe_dispatch_h264_idct8(g_dctx, dst, (size_t)stride,
block, 1, &meta);
}
+
+void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0)
+{
+ daedalus_h264_deblock_meta meta = {
+ .dst_off = 0,
+ .alpha = alpha,
+ .beta = beta,
+ };
+ meta.tc0[0] = tc0[0];
+ meta.tc0[1] = tc0[1];
+ meta.tc0[2] = tc0[2];
+ meta.tc0[3] = tc0[3];
+
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+ daedalus_recipe_dispatch_h264_deblock_luma_v(g_dctx, pix, (size_t)stride,
+ 1, &meta);
+}
diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
index 741e551..85ac381 100644
--- a/libavcodec/aarch64/h264dsp_init_aarch64.c
+++ b/libavcodec/aarch64/h264dsp_init_aarch64.c
@@ -27,6 +27,8 @@
void ff_h264_v_loop_filter_luma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta, int8_t *tc0);
+void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0);
void ff_h264_h_loop_filter_luma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta, int8_t *tc0);
void ff_h264_v_loop_filter_luma_intra_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
@@ -114,7 +116,7 @@ av_cold void ff_h264dsp_init_aarch64(H264DSPContext *c, const int bit_depth,
int cpu_flags = av_get_cpu_flags();
if (have_neon(cpu_flags) && bit_depth == 8) {
- c->v_loop_filter_luma = ff_h264_v_loop_filter_luma_neon;
+ c->v_loop_filter_luma = ff_h264_v_loop_filter_luma_daedalus;
c->h_loop_filter_luma = ff_h264_h_loop_filter_luma_neon;
c->v_loop_filter_luma_intra= ff_h264_v_loop_filter_luma_intra_neon;
c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_neon;
--
2.47.3
@@ -0,0 +1,82 @@
From 0d1292ea99bc4e5fa2da438259fa01a2374e3e04 Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Fri, 22 May 2026 14:18:25 +0200
Subject: [PATCH] avcodec/h264: restore AV_CODEC_FLAG_LOW_DELAY semantics
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
FFmpeg 8.x dropped the H.264 decoder's low_delay path —
AV_CODEC_FLAG_LOW_DELAY no longer prevents
h264_select_output_frame from running the display-order DPB
output queue. V4L2-stateless-style consumers (daedalus-v4l2
daemon, libva-v4l2-request-fourier) that set the flag end up
seeing the 2-1-4-3 pair-swap pattern on B-frame streams again.
Restore the documented semantics:
- Early-exit at the top of h264_select_output_frame when the
flag is set: emit the just-decoded picture immediately as
next_output_pic, mirror the corruption / recovery-point
tracking the main path performs, and skip the entire
delayed_pic[] / POC reorder machinery.
- Suppress the SPS-driven has_b_frames clobber in
h264_field_start when the flag is set, so the per-slice
bitstream_restriction_flag re-pickup cannot reintroduce a
nonzero reorder buffer mid-stream.
This is a fork-only change required by the daedalus-v4l2 daemon's
one-frame-per-send_packet contract; upstream FFmpeg consumers that
expect display-order output remain untouched (flag default = off).
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 deblock
+ flag-restoration follow-up.
---
libavcodec/h264_slice.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/libavcodec/h264_slice.c b/libavcodec/h264_slice.c
index 97fab70..a7bfbd6 100644
--- a/libavcodec/h264_slice.c
+++ b/libavcodec/h264_slice.c
@@ -1308,6 +1308,28 @@ static int h264_select_output_frame(H264Context *h)
cur->mmco_reset = h->mmco_reset;
h->mmco_reset = 0;
+ /* AV_CODEC_FLAG_LOW_DELAY restore (FFmpeg 8.x dropped the H.264
+ * decoder's low_delay path). Bypass the display-order DPB
+ * output queue: emit the just-decoded picture immediately, in
+ * decode order, one per send_packet. V4L2-stateless-style
+ * consumers (daedalus-v4l2 daemon, libva-v4l2-request-fourier)
+ * do their own POC-based reorder downstream and require this
+ * behaviour. */
+ if (h->avctx->flags & AV_CODEC_FLAG_LOW_DELAY) {
+ h->next_output_pic = cur;
+ h->next_outputed_poc = cur->poc;
+ h->frame_recovered |= cur->recovered;
+ cur->recovered |= h->frame_recovered & FRAME_RECOVERED_SEI;
+ if (!cur->recovered) {
+ if (!(h->avctx->flags & AV_CODEC_FLAG_OUTPUT_CORRUPT) &&
+ !(h->avctx->flags2 & AV_CODEC_FLAG2_SHOW_ALL))
+ h->next_output_pic = NULL;
+ else
+ cur->f->flags |= AV_FRAME_FLAG_CORRUPT;
+ }
+ return 0;
+ }
+
if (sps->bitstream_restriction_flag ||
h->avctx->strict_std_compliance >= FF_COMPLIANCE_STRICT) {
h->avctx->has_b_frames = FFMAX(h->avctx->has_b_frames, sps->num_reorder_frames);
@@ -1415,6 +1437,7 @@ static int h264_field_start(H264Context *h, const H264SliceContext *sl,
sps = h->ps.sps;
if (sps->bitstream_restriction_flag &&
+ !(h->avctx->flags & AV_CODEC_FLAG_LOW_DELAY) &&
h->avctx->has_b_frames < sps->num_reorder_frames) {
h->avctx->has_b_frames = sps->num_reorder_frames;
}
--
2.47.3
@@ -0,0 +1,139 @@
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Sat, 23 May 2026 12:00:00 +0200
Subject: [PATCH] avcodec/aarch64/h264qpel: route 8x8 mc20 through
daedalus-fourier
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma horizontal
half-pel, 6-tap "put" variant — the canonical representative of the
H.264 luma motion-compensation family) now dispatches through
daedalus_recipe_dispatch_h264_qpel_mc20 instead of
ff_put_h264_qpel8_mc20_neon.
Cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes the
4-cycle libavcodec.so substitution sequence (6 IDCT 4x4 / 7 IDCT 8x8 /
8 luma-v deblock / 9 qpel mc20).
The recipe layer picks the substrate. Per docs/k9_h264qpel_mc20.md
the verdict is CPU NEON: per-block 7.6 ns at 131 Mblock/s gives 135x
margin over 30 fps 1080p, and the QPU dispatch floor (~250 ns)
makes any V3D shader strictly worse. Substitution is plumbing-only,
NEON-by-recipe — same daedalus_ctx_create_no_qpu pthread_once
context shape the cycles 6/7/8 shims already own (kept SEPARATE
from the H264DSP shim's ctx because H264QPEL is its own libavcodec
Makefile module and link order does not guarantee a single .o
owns the ctx symbol; one extra ~µs init per process, paid lazily).
Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the 16x16
size tier stay on the in-tree NEON .S code. Per the cycle-9 phase-1
rationale, mc20 8x8 is representative of the whole family's per-block
cost — extending the substitution to other variants would multiply
recipe-lookup overhead without changing the substrate verdict.
Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier
cycle 9 green; M1 = 100% bit-exact across 10000 random blocks).
No SONAME change, no Depends change.
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
---
libavcodec/aarch64/Makefile | 3 +-
libavcodec/aarch64/h264_qpel_daedalus.c | 50 ++++++++++++++++++++++
libavcodec/aarch64/h264qpel_init_aarch64.c | 4 +-
3 files changed, 55 insertions(+), 2 deletions(-)
create mode 100644 libavcodec/aarch64/h264_qpel_daedalus.c
diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
@@ -7,7 +7,8 @@ OBJS-$(CONFIG_H264DSP) += aarch64/h264dsp_init_aarch64.o \
aarch64/h264_idct_daedalus.o
OBJS-$(CONFIG_HUFFYUVDSP) += aarch64/huffyuvdsp_init_aarch64.o
OBJS-$(CONFIG_H264PRED) += aarch64/h264pred_init.o
-OBJS-$(CONFIG_H264QPEL) += aarch64/h264qpel_init_aarch64.o
+OBJS-$(CONFIG_H264QPEL) += aarch64/h264qpel_init_aarch64.o \
+ aarch64/h264_qpel_daedalus.o
OBJS-$(CONFIG_HPELDSP) += aarch64/hpeldsp_init_aarch64.o
OBJS-$(CONFIG_IDCTDSP) += aarch64/idctdsp_init_aarch64.o
OBJS-$(CONFIG_ME_CMP) += aarch64/me_cmp_init_aarch64.o
diff --git a/libavcodec/aarch64/h264_qpel_daedalus.c b/libavcodec/aarch64/h264_qpel_daedalus.c
new file mode 100644
--- /dev/null
+++ b/libavcodec/aarch64/h264_qpel_daedalus.c
@@ -0,0 +1,50 @@
+/*
+ * H.264 luma qpel mc20 (8x8, horizontal half-pel, 6-tap "put")
+ * — daedalus-fourier substitution shim.
+ *
+ * Routes H264QpelContext.put_h264_qpel_pixels_tab[1][2] through
+ * daedalus_recipe_dispatch_h264_qpel_mc20 instead of
+ * ff_put_h264_qpel8_mc20_neon. The recipe layer picks the substrate
+ * (CPU NEON for cycle 9; QPU not viable — per-block 7.6 ns vs
+ * ~250 ns QPU dispatch floor, see docs/k9_h264qpel_mc20.md).
+ *
+ * Sibling to libavcodec/aarch64/h264_idct_daedalus.c. We keep a
+ * SEPARATE process-global pthread_once context here instead of
+ * sharing the H264DSP one because H264QPEL is its own libavcodec
+ * Makefile module and link order does not guarantee a single .o
+ * owns the ctx symbol. The cost is one extra
+ * daedalus_ctx_create_no_qpu (~µs) per process; daemon and host
+ * processes pay this lazily on first MC call.
+ *
+ * FFmpeg H264QpelContext convention: both dst and src use a SINGLE
+ * stride and `src` already points at the leftmost OUTPUT column
+ * (col 0); the 6-tap filter reads cols -2..+3. This matches
+ * daedalus_recipe_dispatch_h264_qpel_mc20's documented contract
+ * directly, so dst_off = src_off = 0.
+ */
+
+#include <pthread.h>
+#include <stddef.h>
+#include <stdint.h>
+
+#include <daedalus.h>
+
+#include "libavutil/attributes.h"
+
+static daedalus_ctx *g_dctx;
+static pthread_once_t g_dctx_once = PTHREAD_ONCE_INIT;
+
+static void daedalus_ctx_init_once(void)
+{
+ g_dctx = daedalus_ctx_create_no_qpu();
+}
+
+void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
+
+void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
+{
+ static const daedalus_h264_qpel_meta meta = { .dst_off = 0, .src_off = 0 };
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+ daedalus_recipe_dispatch_h264_qpel_mc20(g_dctx, dst, src, (size_t)stride,
+ 1, &meta);
+}
diff --git a/libavcodec/aarch64/h264qpel_init_aarch64.c b/libavcodec/aarch64/h264qpel_init_aarch64.c
--- a/libavcodec/aarch64/h264qpel_init_aarch64.c
+++ b/libavcodec/aarch64/h264qpel_init_aarch64.c
@@ -47,6 +47,8 @@ void ff_put_h264_qpel8_mc00_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t str
void ff_put_h264_qpel8_mc10_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc20_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc30_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
+void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
void ff_put_h264_qpel8_mc01_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc11_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc21_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
@@ -184,7 +186,7 @@ av_cold void ff_h264qpel_init_aarch64(H264QpelContext *c, int bit_depth)
c->put_h264_qpel_pixels_tab[1][ 0] = ff_put_h264_qpel8_mc00_neon;
c->put_h264_qpel_pixels_tab[1][ 1] = ff_put_h264_qpel8_mc10_neon;
- c->put_h264_qpel_pixels_tab[1][ 2] = ff_put_h264_qpel8_mc20_neon;
+ c->put_h264_qpel_pixels_tab[1][ 2] = ff_put_h264_qpel8_mc20_daedalus;
c->put_h264_qpel_pixels_tab[1][ 3] = ff_put_h264_qpel8_mc30_neon;
c->put_h264_qpel_pixels_tab[1][ 4] = ff_put_h264_qpel8_mc01_neon;
c->put_h264_qpel_pixels_tab[1][ 5] = ff_put_h264_qpel8_mc11_neon;
--
2.47.3
@@ -0,0 +1,92 @@
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: claude-noether <claude-noether@noreply.localhost>
Date: Sun, 25 May 2026 12:00:00 +0200
Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 luma-h deblock through daedalus-fourier
Sibling of 0005 (which substituted v_loop_filter_luma). Same
NEON-to-NEON substitution: H264DSPContext.h_loop_filter_luma →
daedalus_recipe_dispatch_h264_deblock_luma_h. The H kernel landed
in daedalus-fourier PR #9 (CPU NEON only — no QPU shader yet).
libavcodec.so ctx is no-QPU per the existing 0003-0005 / 0007
pattern; we cannot assume Vulkan in arbitrary host processes
(firefox-fourier RDD, mpv-fourier, etc.).
Intra (bS=4) h_loop_filter_luma_intra stays on the in-tree NEON .S
code; daedalus_h264_deblock_meta only covers the non-intra path.
An intra-h substitution can land once daedalus-fourier exposes a
dispatch helper (the kernel already exists internally per PR #11).
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 8 H.
---
diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
--- a/libavcodec/aarch64/h264_idct_daedalus.c 2026-05-25 13:09:33.694760715 +0200
+++ libavcodec/aarch64/h264_idct_daedalus.c 2026-05-25 13:09:33.715603719 +0200
@@ -1,9 +1,10 @@
/*
- * H.264 4x4 / 8x8 IDCT + luma-v deblock — daedalus-fourier substitution shims.
+ * H.264 4x4 / 8x8 IDCT + luma v/h deblock — daedalus-fourier substitution shims.
*
* Routes H264DSPContext.idct_add → daedalus_recipe_dispatch_h264_idct4
* H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
* H264DSPContext.v_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_v
+ * H264DSPContext.h_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_h
* instead of the in-tree ff_h264_*_neon assembly. The recipe layer
* picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
* is CPU primary with QPU opportunistic — the ctx below is no-QPU,
@@ -45,6 +46,8 @@
void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
int alpha, int beta, int8_t *tc0);
+void ff_h264_h_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0);
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
{
@@ -84,3 +87,22 @@
daedalus_recipe_dispatch_h264_deblock_luma_v(g_dctx, pix, (size_t)stride,
1, &meta);
}
+
+void ff_h264_h_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0)
+{
+ daedalus_h264_deblock_meta meta = {
+ .dst_off = 0,
+ .alpha = alpha,
+ .beta = beta,
+ };
+ meta.tc0[0] = tc0[0];
+ meta.tc0[1] = tc0[1];
+ meta.tc0[2] = tc0[2];
+ meta.tc0[3] = tc0[3];
+
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+ daedalus_recipe_dispatch_h264_deblock_luma_h(g_dctx, pix, (size_t)stride,
+ 1, &meta);
+}
diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
--- a/libavcodec/aarch64/h264dsp_init_aarch64.c 2026-05-25 13:09:33.695937103 +0200
+++ libavcodec/aarch64/h264dsp_init_aarch64.c 2026-05-25 13:09:33.715541700 +0200
@@ -31,6 +31,8 @@
int alpha, int beta, int8_t *tc0);
void ff_h264_h_loop_filter_luma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta, int8_t *tc0);
+void ff_h264_h_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0);
void ff_h264_v_loop_filter_luma_intra_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta);
void ff_h264_h_loop_filter_luma_intra_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
@@ -117,7 +119,7 @@
if (have_neon(cpu_flags) && bit_depth == 8) {
c->v_loop_filter_luma = ff_h264_v_loop_filter_luma_daedalus;
- c->h_loop_filter_luma = ff_h264_h_loop_filter_luma_neon;
+ c->h_loop_filter_luma = ff_h264_h_loop_filter_luma_daedalus;
c->v_loop_filter_luma_intra= ff_h264_v_loop_filter_luma_intra_neon;
c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_neon;
--
2.47.3
@@ -0,0 +1,127 @@
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: claude-noether <claude-noether@noreply.localhost>
Date: Sun, 25 May 2026 12:00:00 +0200
Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 chroma v/h deblock through daedalus-fourier
Chroma siblings of 0005 (luma_v) and 0008 (luma_h). Same
NEON-to-NEON pattern via the daedalus recipe layer:
H264DSPContext.v_loop_filter_chroma →
daedalus_recipe_dispatch_h264_deblock_chroma_v
H264DSPContext.h_loop_filter_chroma →
daedalus_recipe_dispatch_h264_deblock_chroma_h
Both kernels landed in daedalus-fourier PR #10. Recipe table
routes AUTO to CPU NEON (no chroma QPU shaders yet), so this
is plumbing-only and stays bit-exact against the in-tree NEON.
Intra chroma (bS=4) loop filters remain on in-tree NEON;
daedalus_h264_deblock_meta covers the non-intra (bS<4) path.
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 8 chroma.
---
diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
--- a/libavcodec/aarch64/h264_idct_daedalus.c 2026-05-25 13:15:45.995368233 +0200
+++ libavcodec/aarch64/h264_idct_daedalus.c 2026-05-25 13:15:46.015839177 +0200
@@ -1,10 +1,12 @@
/*
- * H.264 4x4 / 8x8 IDCT + luma v/h deblock — daedalus-fourier substitution shims.
+ * H.264 4x4 / 8x8 IDCT + luma v/h + chroma v/h deblock — daedalus-fourier substitution shims.
*
* Routes H264DSPContext.idct_add → daedalus_recipe_dispatch_h264_idct4
* H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
- * H264DSPContext.v_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_v
- * H264DSPContext.h_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_h
+ * H264DSPContext.v_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_v
+ * H264DSPContext.h_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_h
+ * H264DSPContext.v_loop_filter_chroma → daedalus_recipe_dispatch_h264_deblock_chroma_v
+ * H264DSPContext.h_loop_filter_chroma → daedalus_recipe_dispatch_h264_deblock_chroma_h
* instead of the in-tree ff_h264_*_neon assembly. The recipe layer
* picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
* is CPU primary with QPU opportunistic — the ctx below is no-QPU,
@@ -48,6 +50,10 @@
int alpha, int beta, int8_t *tc0);
void ff_h264_h_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
int alpha, int beta, int8_t *tc0);
+void ff_h264_v_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0);
+void ff_h264_h_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0);
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
{
@@ -106,3 +112,41 @@
daedalus_recipe_dispatch_h264_deblock_luma_h(g_dctx, pix, (size_t)stride,
1, &meta);
}
+
+void ff_h264_v_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0)
+{
+ daedalus_h264_deblock_meta meta = {
+ .dst_off = 0,
+ .alpha = alpha,
+ .beta = beta,
+ };
+ meta.tc0[0] = tc0[0];
+ meta.tc0[1] = tc0[1];
+ meta.tc0[2] = tc0[2];
+ meta.tc0[3] = tc0[3];
+
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+ daedalus_recipe_dispatch_h264_deblock_chroma_v(g_dctx, pix, (size_t)stride,
+ 1, &meta);
+}
+
+void ff_h264_h_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0)
+{
+ daedalus_h264_deblock_meta meta = {
+ .dst_off = 0,
+ .alpha = alpha,
+ .beta = beta,
+ };
+ meta.tc0[0] = tc0[0];
+ meta.tc0[1] = tc0[1];
+ meta.tc0[2] = tc0[2];
+ meta.tc0[3] = tc0[3];
+
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+ daedalus_recipe_dispatch_h264_deblock_chroma_h(g_dctx, pix, (size_t)stride,
+ 1, &meta);
+}
diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
--- a/libavcodec/aarch64/h264dsp_init_aarch64.c 2026-05-25 13:15:45.996482360 +0200
+++ libavcodec/aarch64/h264dsp_init_aarch64.c 2026-05-25 13:15:46.025604910 +0200
@@ -39,8 +39,12 @@
int beta);
void ff_h264_v_loop_filter_chroma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta, int8_t *tc0);
+void ff_h264_v_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0);
void ff_h264_h_loop_filter_chroma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta, int8_t *tc0);
+void ff_h264_h_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta, int8_t *tc0);
void ff_h264_h_loop_filter_chroma422_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta, int8_t *tc0);
void ff_h264_v_loop_filter_chroma_intra_neon(uint8_t *pix, ptrdiff_t stride,
@@ -123,11 +127,11 @@
c->v_loop_filter_luma_intra= ff_h264_v_loop_filter_luma_intra_neon;
c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_neon;
- c->v_loop_filter_chroma = ff_h264_v_loop_filter_chroma_neon;
+ c->v_loop_filter_chroma = ff_h264_v_loop_filter_chroma_daedalus;
c->v_loop_filter_chroma_intra = ff_h264_v_loop_filter_chroma_intra_neon;
if (chroma_format_idc <= 1) {
- c->h_loop_filter_chroma = ff_h264_h_loop_filter_chroma_neon;
+ c->h_loop_filter_chroma = ff_h264_h_loop_filter_chroma_daedalus;
c->h_loop_filter_chroma_intra = ff_h264_h_loop_filter_chroma_intra_neon;
c->h_loop_filter_chroma_mbaff_intra = ff_h264_h_loop_filter_chroma_mbaff_intra_neon;
} else {
--
2.47.3
@@ -0,0 +1,126 @@
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: claude-noether <claude-noether@noreply.localhost>
Date: Sun, 25 May 2026 12:30:00 +0200
Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 luma intra deblock through daedalus-fourier
Adds the bS=4 intra-strength variants of the already-substituted
luma_v / luma_h deblock (0005, 0008). Intra MBs and certain
inter-MB edges (4x4 transform boundaries inside an Intra_NxN
neighbour) force boundary strength to 4 per H.264 §8.7.2.1.
H264DSPContext.v_loop_filter_luma_intra →
daedalus_recipe_dispatch_h264_deblock_luma_v_intra
H264DSPContext.h_loop_filter_luma_intra →
daedalus_recipe_dispatch_h264_deblock_luma_h_intra
Both kernels landed in daedalus-fourier PR #11. Recipe table
routes AUTO to CPU NEON (no intra QPU shaders yet) — plumbing-
only NEON-to-NEON via daedalus, bit-exact against the in-tree
FFmpeg NEON path.
Signature differs from bS<4: no tc0 argument. The wrapper
passes daedalus_h264_deblock_meta with alpha/beta set; tc0[] is
ignored by the intra dispatch (bS=4 hardcodes the strength).
Chroma intra variants are deferred to a follow-up PR because the
chroma path has a 4:2:0 / 4:2:2 split (chroma_format_idc gating)
that needs explicit conditional substitution to avoid running
the 4:2:0-only daedalus dispatch on 4:2:2 chroma.
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 8 intra.
---
diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
--- a/libavcodec/aarch64/h264_idct_daedalus.c 2026-05-25 13:18:54.992244965 +0200
+++ libavcodec/aarch64/h264_idct_daedalus.c 2026-05-25 13:20:12.338122217 +0200
@@ -1,5 +1,5 @@
/*
- * H.264 4x4 / 8x8 IDCT + luma v/h + chroma v/h deblock — daedalus-fourier substitution shims.
+ * H.264 4x4 / 8x8 IDCT + luma v/h (inter + intra) + chroma v/h deblock — daedalus-fourier substitution shims.
*
* Routes H264DSPContext.idct_add → daedalus_recipe_dispatch_h264_idct4
* H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
@@ -7,6 +7,8 @@
* H264DSPContext.h_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_h
* H264DSPContext.v_loop_filter_chroma → daedalus_recipe_dispatch_h264_deblock_chroma_v
* H264DSPContext.h_loop_filter_chroma → daedalus_recipe_dispatch_h264_deblock_chroma_h
+ * H264DSPContext.v_loop_filter_luma_intra → daedalus_recipe_dispatch_h264_deblock_luma_v_intra
+ * H264DSPContext.h_loop_filter_luma_intra → daedalus_recipe_dispatch_h264_deblock_luma_h_intra
* instead of the in-tree ff_h264_*_neon assembly. The recipe layer
* picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
* is CPU primary with QPU opportunistic — the ctx below is no-QPU,
@@ -54,6 +56,10 @@
int alpha, int beta, int8_t *tc0);
void ff_h264_h_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
int alpha, int beta, int8_t *tc0);
+void ff_h264_v_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta);
+void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta);
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
{
@@ -150,3 +156,34 @@
daedalus_recipe_dispatch_h264_deblock_chroma_h(g_dctx, pix, (size_t)stride,
1, &meta);
}
+
+void ff_h264_v_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta)
+{
+ daedalus_h264_deblock_meta meta = {
+ .dst_off = 0,
+ .alpha = alpha,
+ .beta = beta,
+ };
+ /* tc0[] is ignored by the intra-strength dispatch (bS=4 hardcodes the strength). */
+
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+ daedalus_recipe_dispatch_h264_deblock_luma_v_intra(g_dctx, pix, (size_t)stride,
+ 1, &meta);
+}
+
+void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta)
+{
+ daedalus_h264_deblock_meta meta = {
+ .dst_off = 0,
+ .alpha = alpha,
+ .beta = beta,
+ };
+
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+ daedalus_recipe_dispatch_h264_deblock_luma_h_intra(g_dctx, pix, (size_t)stride,
+ 1, &meta);
+}
diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
--- a/libavcodec/aarch64/h264dsp_init_aarch64.c 2026-05-25 13:18:54.993349573 +0200
+++ libavcodec/aarch64/h264dsp_init_aarch64.c 2026-05-25 13:20:12.338265830 +0200
@@ -35,8 +35,12 @@
int alpha, int beta, int8_t *tc0);
void ff_h264_v_loop_filter_luma_intra_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta);
+void ff_h264_v_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta);
void ff_h264_h_loop_filter_luma_intra_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta);
+void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta);
void ff_h264_v_loop_filter_chroma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta, int8_t *tc0);
void ff_h264_v_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
@@ -124,8 +128,8 @@
if (have_neon(cpu_flags) && bit_depth == 8) {
c->v_loop_filter_luma = ff_h264_v_loop_filter_luma_daedalus;
c->h_loop_filter_luma = ff_h264_h_loop_filter_luma_daedalus;
- c->v_loop_filter_luma_intra= ff_h264_v_loop_filter_luma_intra_neon;
- c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_neon;
+ c->v_loop_filter_luma_intra= ff_h264_v_loop_filter_luma_intra_daedalus;
+ c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_daedalus;
c->v_loop_filter_chroma = ff_h264_v_loop_filter_chroma_daedalus;
c->v_loop_filter_chroma_intra = ff_h264_v_loop_filter_chroma_intra_neon;
--
2.47.3
@@ -0,0 +1,101 @@
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: claude-noether <claude-noether@noreply.localhost>
Date: Sun, 25 May 2026 13:00:00 +0200
Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 chroma DC Hadamard through daedalus-fourier
Substitutes H264DSPContext.chroma_dc_dequant_idct in the
4:2:0 / bit_depth=8 init path with a wrapper that composes
the daedalus chroma DC Hadamard primitive (fourier PR #25)
with qmul scaling FFmpeg does in one fused function.
Bit-exact against ff_h264_chroma_dc_dequant_idct_8_c.
Hadamard correctness gated by fourier PR #23 test suite.
4:2:2 chroma stays on the in-tree 422 variant (same
gating shape as 0009 chroma deblock substitution).
Requires daedalus-fourier commit b9f9ff2 or later (PR #25
exposing the public Hadamard symbol). Pin bumps in PKGBUILD
and build-deb.sh come in the same commit.
---
diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
--- a/libavcodec/aarch64/h264_idct_daedalus.c 2026-05-25 13:38:32.019491484 +0200
+++ libavcodec/aarch64/h264_idct_daedalus.c 2026-05-25 13:38:32.033821507 +0200
@@ -1,5 +1,5 @@
/*
- * H.264 4x4 / 8x8 IDCT + luma v/h (inter + intra) + chroma v/h deblock — daedalus-fourier substitution shims.
+ * H.264 4x4 / 8x8 IDCT + luma v/h (inter+intra) + chroma v/h deblock + chroma DC Hadamard — daedalus-fourier substitution shims.
*
* Routes H264DSPContext.idct_add → daedalus_recipe_dispatch_h264_idct4
* H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
@@ -9,6 +9,7 @@
* H264DSPContext.h_loop_filter_chroma → daedalus_recipe_dispatch_h264_deblock_chroma_h
* H264DSPContext.v_loop_filter_luma_intra → daedalus_recipe_dispatch_h264_deblock_luma_v_intra
* H264DSPContext.h_loop_filter_luma_intra → daedalus_recipe_dispatch_h264_deblock_luma_h_intra
+ * H264DSPContext.chroma_dc_dequant_idct → daedalus_h264_chroma_dc_hadamard_2x2 + caller-side qmul
* instead of the in-tree ff_h264_*_neon assembly. The recipe layer
* picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
* is CPU primary with QPU opportunistic — the ctx below is no-QPU,
@@ -60,6 +61,7 @@
int alpha, int beta);
void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
int alpha, int beta);
+void ff_h264_chroma_dc_dequant_idct_daedalus(int16_t *block, int qmul);
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
{
@@ -187,3 +189,32 @@
daedalus_recipe_dispatch_h264_deblock_luma_h_intra(g_dctx, pix, (size_t)stride,
1, &meta);
}
+
+/* Composes daedalus_h264_chroma_dc_hadamard_2x2 with the qmul scaling
+ * that FFmpeg's reference does in one fused function (h264idct_template.c
+ * ff_h264_chroma_dc_dequant_idct).
+ *
+ * The 4 DC coefficients are scattered across the per-MB coefficient
+ * buffer at offsets [r*stride + c*xStride] (stride=32, xStride=16).
+ * Extract into a contiguous int16[4], run the Hadamard, then apply
+ * the qmul scale and write back to the original positions.
+ *
+ * No daedalus ctx needed; the Hadamard is a pure stateless primitive.
+ */
+void ff_h264_chroma_dc_dequant_idct_daedalus(int16_t *block, int qmul)
+{
+ enum { stride = 32, xStride = 16 };
+ int16_t dc[4];
+
+ dc[0] = block[stride*0 + xStride*0];
+ dc[1] = block[stride*0 + xStride*1];
+ dc[2] = block[stride*1 + xStride*0];
+ dc[3] = block[stride*1 + xStride*1];
+
+ daedalus_h264_chroma_dc_hadamard_2x2(dc);
+
+ block[stride*0 + xStride*0] = (int16_t)((int)dc[0] * qmul >> 7);
+ block[stride*0 + xStride*1] = (int16_t)((int)dc[1] * qmul >> 7);
+ block[stride*1 + xStride*0] = (int16_t)((int)dc[2] * qmul >> 7);
+ block[stride*1 + xStride*1] = (int16_t)((int)dc[3] * qmul >> 7);
+}
diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
--- a/libavcodec/aarch64/h264dsp_init_aarch64.c 2026-05-25 13:38:32.020346459 +0200
+++ libavcodec/aarch64/h264dsp_init_aarch64.c 2026-05-25 13:38:32.033909804 +0200
@@ -41,6 +41,7 @@
int beta);
void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
int alpha, int beta);
+void ff_h264_chroma_dc_dequant_idct_daedalus(int16_t *block, int qmul);
void ff_h264_v_loop_filter_chroma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta, int8_t *tc0);
void ff_h264_v_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
@@ -135,6 +136,7 @@
c->v_loop_filter_chroma_intra = ff_h264_v_loop_filter_chroma_intra_neon;
if (chroma_format_idc <= 1) {
+ c->chroma_dc_dequant_idct = ff_h264_chroma_dc_dequant_idct_daedalus;
c->h_loop_filter_chroma = ff_h264_h_loop_filter_chroma_daedalus;
c->h_loop_filter_chroma_intra = ff_h264_h_loop_filter_chroma_intra_neon;
c->h_loop_filter_chroma_mbaff_intra = ff_h264_h_loop_filter_chroma_mbaff_intra_neon;
--
2.47.3
@@ -0,0 +1,245 @@
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: claude-noether <claude-noether@noreply.localhost>
Date: Sun, 25 May 2026 14:00:00 +0200
Subject: [PATCH] avcodec/aarch64/h264qpel: route remaining qpel 8x8 positions through daedalus-fourier
Closes the H.264 qpel substitution. Extends 0007 (which routed only
mc20 put_) to ALL 15 useful positions in BOTH the put_ and avg_
tables, skipping mc00 (integer copy / pointer-only fast path).
29 substitutions total: 14 new put_ + 15 avg_. Each is a uniform
wrapper around daedalus_recipe_dispatch_h264_qpel_{avg_,}mcXY exposed
by daedalus-fourier PRs #15-#20.
All recipe-table entries route AUTO to CPU NEON (no QPU shaders
for any qpel position other than mc20 yet), so this is plumbing-only
NEON-to-NEON — bit-exact against the in-tree ff_*_h264_qpel8_*_neon
path.
16x16 qpel tables ([0][...]) stay on the in-tree NEON. daedalus
only exposes 8x8 today; 16x16 substitution can land once fourier
provides those variants (likely just dispatching the 8x8 path four
times with shifted dst/src offsets).
Refs reauktion/daedalus-v4l2#11 — substitution arc qpel buildout.
---
diff --git a/libavcodec/aarch64/h264_qpel_daedalus.c b/libavcodec/aarch64/h264_qpel_daedalus.c
--- a/libavcodec/aarch64/h264_qpel_daedalus.c 2026-05-25 14:05:05.789298250 +0200
+++ libavcodec/aarch64/h264_qpel_daedalus.c 2026-05-25 14:05:05.818358374 +0200
@@ -1,10 +1,13 @@
/*
- * H.264 luma qpel mc20 (8x8, horizontal half-pel, 6-tap "put")
- * — daedalus-fourier substitution shim.
+ * H.264 luma qpel 8x8 — daedalus-fourier substitution shims (put_ + avg_).
*
- * Routes H264QpelContext.put_h264_qpel_pixels_tab[1][2] through
- * daedalus_recipe_dispatch_h264_qpel_mc20 instead of
- * ff_put_h264_qpel8_mc20_neon. The recipe layer picks the substrate
+ * Routes ALL 15 useful positions in H264QpelContext's 8x8 put_ and
+ * avg_ tables through daedalus_recipe_dispatch_h264_qpel_mc{XY}
+ * (skipping mc00 which is integer copy / FFmpeg's pointer-only fast
+ * path). Plumbing-only NEON-by-recipe — daedalus-fourier PRs #15-#20
+ * exposed each variant via the same dispatch signature, so the
+ * substitution is a uniform macro across put_/avg_ and across all
+ * 15 mc positions. The recipe layer picks the substrate
* (CPU NEON for cycle 9; QPU not viable — per-block 7.6 ns vs
* ~250 ns QPU dispatch floor, see docs/k9_h264qpel_mc20.md).
*
@@ -48,3 +51,53 @@
daedalus_recipe_dispatch_h264_qpel_mc20(g_dctx, dst, src, (size_t)stride,
1, &meta);
}
+
+
+/* All other 8x8 qpel positions follow the same dispatch shape as mc20
+ * above. The macro collapses ~600 LOC of one-wrapper-per-variant
+ * boilerplate (29 variants total: 14 put_ + 15 avg_). */
+#define DEFINE_QPEL_WRAPPER(type, suffix, dispatch_fn) \
+void ff_ ## type ## _h264_qpel8_ ## suffix ## _daedalus(uint8_t *dst, \
+ const uint8_t *src, ptrdiff_t stride); \
+void ff_ ## type ## _h264_qpel8_ ## suffix ## _daedalus(uint8_t *dst, \
+ const uint8_t *src, ptrdiff_t stride) \
+{ \
+ static const daedalus_h264_qpel_meta meta = { .dst_off = 0, .src_off = 0 }; \
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once); \
+ dispatch_fn(g_dctx, dst, src, (size_t)stride, 1, &meta); \
+}
+
+/* put_ variants (mc20 stays on the explicit definition above). */
+DEFINE_QPEL_WRAPPER(put, mc10, daedalus_recipe_dispatch_h264_qpel_mc10)
+DEFINE_QPEL_WRAPPER(put, mc30, daedalus_recipe_dispatch_h264_qpel_mc30)
+DEFINE_QPEL_WRAPPER(put, mc01, daedalus_recipe_dispatch_h264_qpel_mc01)
+DEFINE_QPEL_WRAPPER(put, mc11, daedalus_recipe_dispatch_h264_qpel_mc11)
+DEFINE_QPEL_WRAPPER(put, mc21, daedalus_recipe_dispatch_h264_qpel_mc21)
+DEFINE_QPEL_WRAPPER(put, mc31, daedalus_recipe_dispatch_h264_qpel_mc31)
+DEFINE_QPEL_WRAPPER(put, mc02, daedalus_recipe_dispatch_h264_qpel_mc02)
+DEFINE_QPEL_WRAPPER(put, mc12, daedalus_recipe_dispatch_h264_qpel_mc12)
+DEFINE_QPEL_WRAPPER(put, mc22, daedalus_recipe_dispatch_h264_qpel_mc22)
+DEFINE_QPEL_WRAPPER(put, mc32, daedalus_recipe_dispatch_h264_qpel_mc32)
+DEFINE_QPEL_WRAPPER(put, mc03, daedalus_recipe_dispatch_h264_qpel_mc03)
+DEFINE_QPEL_WRAPPER(put, mc13, daedalus_recipe_dispatch_h264_qpel_mc13)
+DEFINE_QPEL_WRAPPER(put, mc23, daedalus_recipe_dispatch_h264_qpel_mc23)
+DEFINE_QPEL_WRAPPER(put, mc33, daedalus_recipe_dispatch_h264_qpel_mc33)
+
+/* avg_ variants — all 15 useful positions. */
+DEFINE_QPEL_WRAPPER(avg, mc10, daedalus_recipe_dispatch_h264_qpel_avg_mc10)
+DEFINE_QPEL_WRAPPER(avg, mc20, daedalus_recipe_dispatch_h264_qpel_avg_mc20)
+DEFINE_QPEL_WRAPPER(avg, mc30, daedalus_recipe_dispatch_h264_qpel_avg_mc30)
+DEFINE_QPEL_WRAPPER(avg, mc01, daedalus_recipe_dispatch_h264_qpel_avg_mc01)
+DEFINE_QPEL_WRAPPER(avg, mc11, daedalus_recipe_dispatch_h264_qpel_avg_mc11)
+DEFINE_QPEL_WRAPPER(avg, mc21, daedalus_recipe_dispatch_h264_qpel_avg_mc21)
+DEFINE_QPEL_WRAPPER(avg, mc31, daedalus_recipe_dispatch_h264_qpel_avg_mc31)
+DEFINE_QPEL_WRAPPER(avg, mc02, daedalus_recipe_dispatch_h264_qpel_avg_mc02)
+DEFINE_QPEL_WRAPPER(avg, mc12, daedalus_recipe_dispatch_h264_qpel_avg_mc12)
+DEFINE_QPEL_WRAPPER(avg, mc22, daedalus_recipe_dispatch_h264_qpel_avg_mc22)
+DEFINE_QPEL_WRAPPER(avg, mc32, daedalus_recipe_dispatch_h264_qpel_avg_mc32)
+DEFINE_QPEL_WRAPPER(avg, mc03, daedalus_recipe_dispatch_h264_qpel_avg_mc03)
+DEFINE_QPEL_WRAPPER(avg, mc13, daedalus_recipe_dispatch_h264_qpel_avg_mc13)
+DEFINE_QPEL_WRAPPER(avg, mc23, daedalus_recipe_dispatch_h264_qpel_avg_mc23)
+DEFINE_QPEL_WRAPPER(avg, mc33, daedalus_recipe_dispatch_h264_qpel_avg_mc33)
+
+#undef DEFINE_QPEL_WRAPPER
diff --git a/libavcodec/aarch64/h264qpel_init_aarch64.c b/libavcodec/aarch64/h264qpel_init_aarch64.c
--- a/libavcodec/aarch64/h264qpel_init_aarch64.c 2026-05-25 14:05:05.790403989 +0200
+++ libavcodec/aarch64/h264qpel_init_aarch64.c 2026-05-25 14:05:05.819136071 +0200
@@ -50,6 +50,64 @@
void ff_put_h264_qpel8_mc30_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src,
ptrdiff_t stride);
+void ff_put_h264_qpel8_mc10_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_put_h264_qpel8_mc30_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_put_h264_qpel8_mc01_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_put_h264_qpel8_mc11_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_put_h264_qpel8_mc21_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_put_h264_qpel8_mc31_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_put_h264_qpel8_mc02_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_put_h264_qpel8_mc12_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_put_h264_qpel8_mc22_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_put_h264_qpel8_mc32_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_put_h264_qpel8_mc03_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_put_h264_qpel8_mc13_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_put_h264_qpel8_mc23_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_put_h264_qpel8_mc33_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc10_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc30_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc01_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc11_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc21_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc31_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc02_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc12_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc22_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc32_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc03_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc13_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc23_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc33_daedalus(uint8_t *dst, const uint8_t *src,
+ ptrdiff_t stride);
void ff_put_h264_qpel8_mc01_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc11_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
void ff_put_h264_qpel8_mc21_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
@@ -164,21 +222,21 @@
c->put_h264_qpel_pixels_tab[0][15] = ff_put_h264_qpel16_mc33_neon;
c->put_h264_qpel_pixels_tab[1][ 0] = ff_put_h264_qpel8_mc00_neon;
- c->put_h264_qpel_pixels_tab[1][ 1] = ff_put_h264_qpel8_mc10_neon;
+ c->put_h264_qpel_pixels_tab[1][ 1] = ff_put_h264_qpel8_mc10_daedalus;
c->put_h264_qpel_pixels_tab[1][ 2] = ff_put_h264_qpel8_mc20_daedalus;
- c->put_h264_qpel_pixels_tab[1][ 3] = ff_put_h264_qpel8_mc30_neon;
- c->put_h264_qpel_pixels_tab[1][ 4] = ff_put_h264_qpel8_mc01_neon;
- c->put_h264_qpel_pixels_tab[1][ 5] = ff_put_h264_qpel8_mc11_neon;
- c->put_h264_qpel_pixels_tab[1][ 6] = ff_put_h264_qpel8_mc21_neon;
- c->put_h264_qpel_pixels_tab[1][ 7] = ff_put_h264_qpel8_mc31_neon;
- c->put_h264_qpel_pixels_tab[1][ 8] = ff_put_h264_qpel8_mc02_neon;
- c->put_h264_qpel_pixels_tab[1][ 9] = ff_put_h264_qpel8_mc12_neon;
- c->put_h264_qpel_pixels_tab[1][10] = ff_put_h264_qpel8_mc22_neon;
- c->put_h264_qpel_pixels_tab[1][11] = ff_put_h264_qpel8_mc32_neon;
- c->put_h264_qpel_pixels_tab[1][12] = ff_put_h264_qpel8_mc03_neon;
- c->put_h264_qpel_pixels_tab[1][13] = ff_put_h264_qpel8_mc13_neon;
- c->put_h264_qpel_pixels_tab[1][14] = ff_put_h264_qpel8_mc23_neon;
- c->put_h264_qpel_pixels_tab[1][15] = ff_put_h264_qpel8_mc33_neon;
+ c->put_h264_qpel_pixels_tab[1][ 3] = ff_put_h264_qpel8_mc30_daedalus;
+ c->put_h264_qpel_pixels_tab[1][ 4] = ff_put_h264_qpel8_mc01_daedalus;
+ c->put_h264_qpel_pixels_tab[1][ 5] = ff_put_h264_qpel8_mc11_daedalus;
+ c->put_h264_qpel_pixels_tab[1][ 6] = ff_put_h264_qpel8_mc21_daedalus;
+ c->put_h264_qpel_pixels_tab[1][ 7] = ff_put_h264_qpel8_mc31_daedalus;
+ c->put_h264_qpel_pixels_tab[1][ 8] = ff_put_h264_qpel8_mc02_daedalus;
+ c->put_h264_qpel_pixels_tab[1][ 9] = ff_put_h264_qpel8_mc12_daedalus;
+ c->put_h264_qpel_pixels_tab[1][10] = ff_put_h264_qpel8_mc22_daedalus;
+ c->put_h264_qpel_pixels_tab[1][11] = ff_put_h264_qpel8_mc32_daedalus;
+ c->put_h264_qpel_pixels_tab[1][12] = ff_put_h264_qpel8_mc03_daedalus;
+ c->put_h264_qpel_pixels_tab[1][13] = ff_put_h264_qpel8_mc13_daedalus;
+ c->put_h264_qpel_pixels_tab[1][14] = ff_put_h264_qpel8_mc23_daedalus;
+ c->put_h264_qpel_pixels_tab[1][15] = ff_put_h264_qpel8_mc33_daedalus;
c->avg_h264_qpel_pixels_tab[0][ 0] = ff_avg_h264_qpel16_mc00_neon;
c->avg_h264_qpel_pixels_tab[0][ 1] = ff_avg_h264_qpel16_mc10_neon;
@@ -198,21 +256,21 @@
c->avg_h264_qpel_pixels_tab[0][15] = ff_avg_h264_qpel16_mc33_neon;
c->avg_h264_qpel_pixels_tab[1][ 0] = ff_avg_h264_qpel8_mc00_neon;
- c->avg_h264_qpel_pixels_tab[1][ 1] = ff_avg_h264_qpel8_mc10_neon;
- c->avg_h264_qpel_pixels_tab[1][ 2] = ff_avg_h264_qpel8_mc20_neon;
- c->avg_h264_qpel_pixels_tab[1][ 3] = ff_avg_h264_qpel8_mc30_neon;
- c->avg_h264_qpel_pixels_tab[1][ 4] = ff_avg_h264_qpel8_mc01_neon;
- c->avg_h264_qpel_pixels_tab[1][ 5] = ff_avg_h264_qpel8_mc11_neon;
- c->avg_h264_qpel_pixels_tab[1][ 6] = ff_avg_h264_qpel8_mc21_neon;
- c->avg_h264_qpel_pixels_tab[1][ 7] = ff_avg_h264_qpel8_mc31_neon;
- c->avg_h264_qpel_pixels_tab[1][ 8] = ff_avg_h264_qpel8_mc02_neon;
- c->avg_h264_qpel_pixels_tab[1][ 9] = ff_avg_h264_qpel8_mc12_neon;
- c->avg_h264_qpel_pixels_tab[1][10] = ff_avg_h264_qpel8_mc22_neon;
- c->avg_h264_qpel_pixels_tab[1][11] = ff_avg_h264_qpel8_mc32_neon;
- c->avg_h264_qpel_pixels_tab[1][12] = ff_avg_h264_qpel8_mc03_neon;
- c->avg_h264_qpel_pixels_tab[1][13] = ff_avg_h264_qpel8_mc13_neon;
- c->avg_h264_qpel_pixels_tab[1][14] = ff_avg_h264_qpel8_mc23_neon;
- c->avg_h264_qpel_pixels_tab[1][15] = ff_avg_h264_qpel8_mc33_neon;
+ c->avg_h264_qpel_pixels_tab[1][ 1] = ff_avg_h264_qpel8_mc10_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][ 2] = ff_avg_h264_qpel8_mc20_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][ 3] = ff_avg_h264_qpel8_mc30_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][ 4] = ff_avg_h264_qpel8_mc01_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][ 5] = ff_avg_h264_qpel8_mc11_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][ 6] = ff_avg_h264_qpel8_mc21_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][ 7] = ff_avg_h264_qpel8_mc31_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][ 8] = ff_avg_h264_qpel8_mc02_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][ 9] = ff_avg_h264_qpel8_mc12_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][10] = ff_avg_h264_qpel8_mc22_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][11] = ff_avg_h264_qpel8_mc32_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][12] = ff_avg_h264_qpel8_mc03_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][13] = ff_avg_h264_qpel8_mc13_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][14] = ff_avg_h264_qpel8_mc23_daedalus;
+ c->avg_h264_qpel_pixels_tab[1][15] = ff_avg_h264_qpel8_mc33_daedalus;
} else if (have_neon(cpu_flags) && bit_depth == 10) {
c->put_h264_qpel_pixels_tab[0][ 1] = ff_put_h264_qpel16_mc10_neon_10;
c->put_h264_qpel_pixels_tab[0][ 2] = ff_put_h264_qpel16_mc20_neon_10;
--
2.47.3
@@ -0,0 +1,120 @@
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: claude-noether <claude-noether@noreply.localhost>
Date: Sun, 25 May 2026 14:30:00 +0200
Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 chroma intra deblock (4:2:0) through daedalus-fourier
Substitutes c->v_loop_filter_chroma_intra and c->h_loop_filter_chroma_intra
with daedalus wrappers in the bit_depth=8 / chroma_format_idc<=1 (4:2:0)
branch. 4:2:2 stays on the in-tree NEON path (the daedalus chroma intra
dispatch is 4:2:0-only).
The fourier dispatches were exposed in PR #11 (DEFINE_INTRA_DISPATCH
macro generates the public daedalus_dispatch_h264_deblock_chroma_*_intra
symbols + recipe wrappers).
Re-architects the chroma init: v_loop_filter_chroma_intra was previously
assigned unconditionally to the NEON variant (which works for both 4:2:0
and 4:2:2). We now assign it INSIDE both branches of the chroma_format_idc
conditional, with the 4:2:0 branch picking daedalus and the 4:2:2 branch
keeping NEON. No regression for 4:2:2 streams.
Same NEON-to-NEON via recipe shape as 0010 luma intra.
Refs reauktion/daedalus-v4l2#11 — substitution arc chroma intra.
---
diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
--- a/libavcodec/aarch64/h264_idct_daedalus.c 2026-05-25 14:21:08.267156263 +0200
+++ libavcodec/aarch64/h264_idct_daedalus.c 2026-05-25 14:21:08.287745931 +0200
@@ -1,5 +1,5 @@
/*
- * H.264 4x4 / 8x8 IDCT + luma v/h (inter+intra) + chroma v/h deblock + chroma DC Hadamard — daedalus-fourier substitution shims.
+ * H.264 4x4 / 8x8 IDCT + luma v/h (inter+intra) + chroma v/h (inter+intra) deblock + chroma DC Hadamard — daedalus-fourier substitution shims.
*
* Routes H264DSPContext.idct_add → daedalus_recipe_dispatch_h264_idct4
* H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
@@ -9,6 +9,8 @@
* H264DSPContext.h_loop_filter_chroma → daedalus_recipe_dispatch_h264_deblock_chroma_h
* H264DSPContext.v_loop_filter_luma_intra → daedalus_recipe_dispatch_h264_deblock_luma_v_intra
* H264DSPContext.h_loop_filter_luma_intra → daedalus_recipe_dispatch_h264_deblock_luma_h_intra
+ * H264DSPContext.v_loop_filter_chroma_intra → daedalus_recipe_dispatch_h264_deblock_chroma_v_intra
+ * H264DSPContext.h_loop_filter_chroma_intra → daedalus_recipe_dispatch_h264_deblock_chroma_h_intra
* H264DSPContext.chroma_dc_dequant_idct → daedalus_h264_chroma_dc_hadamard_2x2 + caller-side qmul
* instead of the in-tree ff_h264_*_neon assembly. The recipe layer
* picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
@@ -61,6 +63,10 @@
int alpha, int beta);
void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
int alpha, int beta);
+void ff_h264_v_loop_filter_chroma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta);
+void ff_h264_h_loop_filter_chroma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta);
void ff_h264_chroma_dc_dequant_idct_daedalus(int16_t *block, int qmul);
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
@@ -218,3 +224,30 @@
block[stride*1 + xStride*0] = (int16_t)((int)dc[2] * qmul >> 7);
block[stride*1 + xStride*1] = (int16_t)((int)dc[3] * qmul >> 7);
}
+
+void ff_h264_v_loop_filter_chroma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta)
+{
+ daedalus_h264_deblock_meta meta = {
+ .dst_off = 0,
+ .alpha = alpha,
+ .beta = beta,
+ };
+ /* tc0[] unused for intra (bS=4 hardcodes the strength). */
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+ daedalus_recipe_dispatch_h264_deblock_chroma_v_intra(g_dctx, pix, (size_t)stride,
+ 1, &meta);
+}
+
+void ff_h264_h_loop_filter_chroma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta)
+{
+ daedalus_h264_deblock_meta meta = {
+ .dst_off = 0,
+ .alpha = alpha,
+ .beta = beta,
+ };
+ pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+ daedalus_recipe_dispatch_h264_deblock_chroma_h_intra(g_dctx, pix, (size_t)stride,
+ 1, &meta);
+}
diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
--- a/libavcodec/aarch64/h264dsp_init_aarch64.c 2026-05-25 14:21:08.268311057 +0200
+++ libavcodec/aarch64/h264dsp_init_aarch64.c 2026-05-25 14:21:08.287886563 +0200
@@ -42,6 +42,10 @@
void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
int alpha, int beta);
void ff_h264_chroma_dc_dequant_idct_daedalus(int16_t *block, int qmul);
+void ff_h264_v_loop_filter_chroma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta);
+void ff_h264_h_loop_filter_chroma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+ int alpha, int beta);
void ff_h264_v_loop_filter_chroma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
int beta, int8_t *tc0);
void ff_h264_v_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
@@ -133,14 +137,15 @@
c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_daedalus;
c->v_loop_filter_chroma = ff_h264_v_loop_filter_chroma_daedalus;
- c->v_loop_filter_chroma_intra = ff_h264_v_loop_filter_chroma_intra_neon;
if (chroma_format_idc <= 1) {
c->chroma_dc_dequant_idct = ff_h264_chroma_dc_dequant_idct_daedalus;
+ c->v_loop_filter_chroma_intra = ff_h264_v_loop_filter_chroma_intra_daedalus;
c->h_loop_filter_chroma = ff_h264_h_loop_filter_chroma_daedalus;
- c->h_loop_filter_chroma_intra = ff_h264_h_loop_filter_chroma_intra_neon;
+ c->h_loop_filter_chroma_intra = ff_h264_h_loop_filter_chroma_intra_daedalus;
c->h_loop_filter_chroma_mbaff_intra = ff_h264_h_loop_filter_chroma_mbaff_intra_neon;
} else {
+ c->v_loop_filter_chroma_intra = ff_h264_v_loop_filter_chroma_intra_neon;
c->h_loop_filter_chroma = ff_h264_h_loop_filter_chroma422_neon;
c->h_loop_filter_chroma_mbaff = ff_h264_h_loop_filter_chroma_neon;
c->h_loop_filter_chroma_intra = ff_h264_h_loop_filter_chroma422_intra_neon;
--
2.47.3
+55 -3
View File
@@ -33,7 +33,19 @@ FFMPEG_VERSION=8.1
# epoch 2 matches Debian's stock ffmpeg (currently 7:7.1.x in trixie);
# +rfourier suffix to avoid colliding with upstream/Debian rebuilds.
PKGVER=2:${FFMPEG_VERSION}+rfourier+gb57fbbe
PKGREL=2 # pkgrel=2Path A move to /opt/fourier prefix (2026-05-19)
PKGREL=10 # pkgrel=10H.264 luma qpel mc20 daedalus-fourier substitution
# (cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes
# the libavcodec.so substitution sequence 6 IDCT4 / 7 IDCT8 /
# 8 luma-v deblock / 9 qpel mc20). Pulls daedalus-fourier PR #2
# which extends the public API with
# daedalus_recipe_dispatch_h264_qpel_mc20. (2026-05-23)
# daedalus-fourier pin. 209a421 = daedalus-fourier PR #2 merge — public
# API now exposes daedalus_recipe_dispatch_h264_qpel_mc20 +
# DAEDALUS_KERNEL_H264_QPEL_MC20. Cycle 9 plumbs the last H.264 NEON
# kernel through the recipe layer. Daemon-side build (debian/daedalus-v4l2)
# can bump in a follow-up; this PR only changes the libavcodec.so consumer.
DAEDALUS_FOURIER_COMMIT=b9f9ff2a89c068aea54dcb52b543afddad28311e # PR #25 — public chroma DC Hadamard
HERE=$(dirname "$(readlink -f "$0")")
@@ -57,6 +69,44 @@ fi
# Apply patches (same as Arch).
patch -Np1 -i "$HERE/0001-libudev-bypass-fallback.patch"
patch -Np1 -i "$HERE/0002-nv15-to-p010-unpack.patch"
patch -Np1 -i "$HERE/0003-h264-idct4-daedalus-fourier.patch"
patch -Np1 -i "$HERE/0004-h264-idct8-daedalus-fourier.patch"
patch -Np1 -i "$HERE/0005-h264-deblock-luma-v-daedalus-fourier.patch"
patch -Np1 -i "$HERE/0006-h264-restore-low-delay.patch"
patch -Np1 -i "$HERE/0007-h264-qpel-mc20-daedalus-fourier.patch"
patch -Np1 -i "$HERE/0008-h264-deblock-luma-h-daedalus-fourier.patch"
patch -Np1 -i "$HERE/0009-h264-deblock-chroma-daedalus-fourier.patch"
patch -Np1 -i "$HERE/0010-h264-deblock-luma-intra-daedalus-fourier.patch"
patch -Np1 -i "$HERE/0011-h264-chroma-dc-hadamard-daedalus-fourier.patch"
patch -Np1 -i "$HERE/0012-h264-qpel-rest-daedalus-fourier.patch"
patch -Np1 -i "$HERE/0013-h264-deblock-chroma-intra-daedalus-fourier.patch"
# --- daedalus-fourier: fetch + build static .a with PIC, install to a
# per-build prefix; libavcodec.so links it into the shared object so
# H264DSPContext.idct_add (and follow-up kernels) dispatch through the
# daedalus recipe layer instead of the in-tree NEON .S code. ---
#
# PIC is mandatory — the static .a is linked into a .so, so all object
# code must be relocatable. Vulkan is PUBLIC-linked by daedalus_core
# (queryable QPU substrate); we add libvulkan1 to Debian Depends below
# so dlopen of libavcodec.so.62 succeeds on stock trixie.
FOURIER_PREFIX=$work/fourier-prefix
mkdir -p "$FOURIER_PREFIX"
pushd "$work" >/dev/null
curl --connect-timeout 10 --max-time 600 --retry 3 --retry-delay 5 -sSLfo daedalus-fourier.tar.gz \
"https://git.reauktion.de/marfrit/daedalus-fourier/archive/${DAEDALUS_FOURIER_COMMIT}.tar.gz"
tar xzf daedalus-fourier.tar.gz
pushd daedalus-fourier >/dev/null
cmake -B build -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_POSITION_INDEPENDENT_CODE=ON \
-DCMAKE_INSTALL_PREFIX="$FOURIER_PREFIX"
cmake --build build --target daedalus_core
cmake --install build
popd >/dev/null
popd >/dev/null
cd "$work/FFmpeg"
# Configure with Arch-parity flags. Drops the same set of features
# (X11, AMF, CUDA, FireWire, AviSynth, Bluray, OpenMPT, JPEG-XL,
@@ -73,6 +123,9 @@ patch -Np1 -i "$HERE/0002-nv15-to-p010-unpack.patch"
--mandir=/opt/fourier/share/man \
--extra-ldexeflags='-Wl,-rpath,/opt/fourier/lib' \
--extra-ldsoflags='-Wl,-rpath,/opt/fourier/lib' \
--extra-cflags="-I${FOURIER_PREFIX}/include" \
--extra-ldflags="-L${FOURIER_PREFIX}/lib" \
--extra-libs="-ldaedalus_core -lvulkan -lpthread" \
--disable-debug \
--disable-static \
--disable-doc \
@@ -95,7 +148,6 @@ patch -Np1 -i "$HERE/0002-nv15-to-p010-unpack.patch"
--enable-libass \
--enable-libfreetype \
--enable-libfribidi \
--enable-libxml2 \
--enable-libpulse \
--enable-libdav1d \
--enable-libopus \
@@ -147,10 +199,10 @@ Priority: optional
Architecture: arm64
Depends: libc6,
libdrm2,
libvulkan1,
libfontconfig1,
libfreetype6,
libfribidi0,
libxml2,
libpulse0,
libdav1d7 | libdav1d6,
libopus0,
+168
View File
@@ -1,3 +1,171 @@
ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-10) bookworm trixie; urgency=medium
* Add 0007-h264-qpel-mc20-daedalus-fourier.patch —
H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma
horizontal half-pel, 6-tap "put" — the canonical representative
of the H.264 luma motion-compensation family) now dispatches
through daedalus_recipe_dispatch_h264_qpel_mc20 instead of
ff_put_h264_qpel8_mc20_neon. Cycle 9 of the daedalus-v4l2#11
step 2 substitution arc; closes the 4-cycle libavcodec.so
substitution sequence (6 IDCT4 / 7 IDCT8 / 8 luma-v deblock /
9 qpel mc20).
* Bumps daedalus-fourier pin d87239d → 209a421 (PR #2 — public
API extended with daedalus_recipe_dispatch_h264_qpel_mc20 +
DAEDALUS_KERNEL_H264_QPEL_MC20).
* Cycle 9 is "CPU primary; QPU pointless" per
docs/k9_h264qpel_mc20.md. Per-block 7.6 ns at 131 Mblock/s
gives 135x margin over 30 fps 1080p; QPU dispatch floor at
~250 ns makes any V3D shader strictly worse. Substitution
is plumbing-only, NEON-by-recipe — same
daedalus_ctx_create_no_qpu pthread_once shape the cycles 6/7/8
shims already own (kept SEPARATE from the H264DSP shim's ctx
because H264QPEL is its own libavcodec Makefile module and
link order does not guarantee a single .o owns the ctx symbol;
one extra ~µs init per process, paid lazily on first MC call).
* Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the
16x16 size tier stay on the in-tree NEON .S code. Per the
cycle-9 phase-1 rationale, mc20 8x8 is representative of the
whole family's per-block cost.
* Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier
cycle 9 green; 10000/10000 random blocks).
* No SONAME change, no Depends change.
-- Markus Fritsche <mfritsche@reauktion.de> Sat, 23 May 2026 12:00:00 +0000
ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-9) bookworm trixie; urgency=medium
* Add 0006-h264-restore-low-delay.patch — restore the documented
AV_CODEC_FLAG_LOW_DELAY semantics in the H.264 decoder. FFmpeg
8.x dropped the H.264 low_delay code path entirely; setting the
flag at avcodec_open2 no longer prevents the display-order DPB
output queue from running. Visible on Firefox YouTube as the
2-1-4-3 B-frame pair-swap, re-introduced silently by the
SONAME 61→62 jump in daedalus-v4l2 PR #16.
* h264_select_output_frame: early-exit when LOW_DELAY is set;
emit the just-decoded picture as next_output_pic, mirror the
corruption / recovery-point tracking, skip delayed_pic[] and
the POC reorder machinery entirely.
* h264_field_start: suppress the SPS-driven
has_b_frames = sps->num_reorder_frames clobber when LOW_DELAY
is set — without this the per-slice bitstream_restriction_flag
re-pickup would reintroduce a nonzero reorder buffer mid-
stream.
* Restores the same one-frame-per-send_packet contract the
daedalus-v4l2 daemon's decoder.c already relies on (the flag
is set unconditionally for H.264). No daemon side change.
* No SONAME change, no Depends change.
-- Markus Fritsche <mfritsche@reauktion.de> Fri, 22 May 2026 13:30:00 +0000
ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-8) bookworm trixie; urgency=medium
* Add 0005-h264-deblock-luma-v-daedalus-fourier.patch —
H264DSPContext.v_loop_filter_luma (non-intra bS<4 vertical luma
deblock, called per macroblock-row edge from the slice deblock
loop in libavcodec/h264_loopfilter.c) now dispatches through
daedalus_recipe_dispatch_h264_deblock_luma_v instead of
ff_h264_v_loop_filter_luma_neon. Cycle 8 of the daedalus-v4l2#11
step 2 substitution arc.
* Cycle 8 is marked "CPU primary; QPU opportunistic" in
daedalus-fourier, but the libavcodec.so context here uses
daedalus_ctx_create_no_qpu (process-global pthread_once,
shared with cycles 6/7). Opportunistic QPU is deferred to a
separate change that gates Vulkan init on a feature flag, to
avoid implicit Vulkan init in arbitrary host processes. For
now cycle 8 is plumbing-only — NEON-by-recipe.
* Intra (bS=4) loop filter c->v_loop_filter_luma_intra stays on
the in-tree NEON .S code; daedalus's daedalus_h264_deblock_meta
only covers the non-intra path per its API docstring.
* Bit-exact against ff_h264_v_loop_filter_luma_neon (daedalus-fourier
cycle 8 green).
* No SONAME change, no Depends change.
-- Markus Fritsche <mfritsche@reauktion.de> Fri, 22 May 2026 12:30:00 +0000
ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-7) bookworm trixie; urgency=medium
* Add 0004-h264-idct8-daedalus-fourier.patch — H264DSPContext.idct8_add
(per-block 8x8 IDCT, called from the High-profile intra-8x8-DCT
macroblock path in libavcodec/h264_mb.c) now dispatches through
daedalus_recipe_dispatch_h264_idct8 instead of
ff_h264_idct8_add_neon. Cycle 7 of the daedalus-v4l2#11 step 2
substitution arc — NEON-by-recipe, same pthread_once context the
cycle-6 IDCT 4x4 shim already owns.
* Bit-exact against ff_h264_idct8_add_neon (daedalus-fourier cycle 7
green; FFmpeg 8x8 block storage block[r + 8*c] matches daedalus
column-major convention).
* Bulk c->idct8_add4 (inter 8x8-DCT macroblocks) stays on the
in-tree NEON .S code; batched substitution lands later.
* No SONAME change, no Depends change.
-- Markus Fritsche <mfritsche@reauktion.de> Fri, 22 May 2026 10:30:00 +0000
ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-6) bookworm trixie; urgency=medium
* Drop --enable-libxml2 + libxml2 Depends — the Gitea
debian-aarch64 runner ships libxml2 ≥ 2.14 (SONAME 16) while
Debian trixie targets 2.12 (SONAME 2). -5 built fine, then
failed to load on higgs trixie:
dlopen(libavformat.so.62): libxml2.so.16:
cannot open shared object file
Neither the daedalus-v4l2 daemon (direct AVPacket feed —
libavformat used only for the in-tree v4l2request hwaccel
glue) nor mpv-fourier (Lua + ytdlp + mpv's stream code do
DASH/HLS) nor firefox-fourier (gecko-media DASH demux)
consumes FFmpeg's libxml2-backed DASH demuxer, so dropping is
feature-neutral. Mirrors the libva trixie/runner ABI-skew
workaround documented in PR #62.
* CI workflow build-deps lose libxml2-dev for the same reason.
* No source code change beyond configure flags + Depends.
Substitution stays as PRs #76/#77 landed.
-- Markus Fritsche <mfritsche@reauktion.de> Thu, 21 May 2026 23:30:00 +0000
ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-5) bookworm trixie; urgency=medium
* pkgrel-only bump (3 → 5) to force a rebuild of the H.264 IDCT 4x4
daedalus-fourier substitution that landed in marfrit-packages PR
#76. An orphan -4 .deb already sat in the apt pool (dated
2026-05-19, no matching source commit in main); CI's
check-already-published.sh compares with `dpkg --compare-versions
pool_ver ge source_full`, which short-circuited PR #76's -3
build. Skipping past -4 lets the CI workflow actually publish the
substitution.
* No source code change beyond PKGREL and this changelog entry.
Substitution + control + build-deb.sh wiring stay as PR #76 left
them.
-- Markus Fritsche <mfritsche@reauktion.de> Thu, 21 May 2026 21:30:00 +0000
ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-3) bookworm trixie; urgency=medium
* Add 0003-h264-idct4-daedalus-fourier.patch — H264DSPContext.idct_add
(per-block 4x4 IDCT, called from the intra-4x4 decode path in
libavcodec/h264_mb.c) now dispatches through
daedalus_recipe_dispatch_h264_idct4 instead of
ff_h264_idct_add_neon. First end-to-end exercise of the
daedalus-fourier kernel pack inside libavcodec.so on the
production decode hot path (daedalus-v4l2#11 step 2 — cycle 6
H.264 IDCT 4x4, NEON-by-recipe).
* build-deb.sh: fetches + builds daedalus-fourier (pinned at
d87239d, lockstep with the daemon's static link) with
-fPIC into a per-build temp prefix, then passes
--extra-cflags=-I.../include --extra-ldflags=-L.../lib
--extra-libs="-ldaedalus_core -lvulkan -lpthread" to FFmpeg
configure. Static-linked into libavcodec.so.62.
* Bulk paths (idct_add16 / idct_add16intra / idct_add8) remain on
the stock NEON .S code and will be batched through
daedalus_recipe_dispatch_h264_idct4 with n_blocks>1 in a
follow-up. Cycles 7/8/9 (IDCT 8x8 / luma-v deblock / qpel mc20)
land in subsequent patches.
* Depends gains libvulkan1 — daedalus_core PUBLIC-links Vulkan
(queryable QPU substrate); the no-QPU constructor still works,
but the loader refuses libavcodec.so.62 at dlopen time without
libvulkan.so.1 present.
* No ABI change; SONAMEs stay 62/62/60.
-- Markus Fritsche <mfritsche@reauktion.de> Thu, 21 May 2026 20:00:00 +0000
ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-1) bookworm trixie; urgency=medium
* Initial Debian packaging for the Kwiboo FFmpeg fork with V4L2