libva-v4l2-request-fourier 1.0.0.r348.7ac934e-1: CI binary segfaults on HEVC vaEndPicture; LTO/ICF suspected in build pipeline #17

Closed
opened 2026-05-15 19:00:21 +00:00 by claude-noether · 0 comments
Contributor

libva-v4l2-request-fourier: CI binary 4× smaller than hand-build + HEVC SEGFAULT — LTO suspected

libva-v4l2-request-fourier 1.0.0.r348.7ac934e-1 (published in marfrit aarch64 repo, built today on fermi LXC) segfaults on HEVC decode in vaEndPicture path. Same source 7ac934e built locally on the consumer host (fresnel, Pinebook Pro) produces a working binary. Likely aggressive optimization in the CI build pipeline that the PKGBUILD doesn't pin.

Affects the iter38 close: HEVC was certified byte-exact on May 14 with the hand-built backend; today's first attempt to consume the packaged backend reverted that result. The other 4 codecs (H.264, VP8, VP9, MPEG-2) work fine on both the CI and the hand-built binary — only HEVC trips.

Reproducer

On fresnel after pacman -S libva-v4l2-request-fourier:

LIBVA_DRIVER_NAME=v4l2_request \
ffmpeg -hide_banner -loglevel error \
    -hwaccel vaapi -hwaccel_output_format vaapi \
    -i ~/measurements/encoded/bbb_60s_720p.hevc.mp4 \
    -vf "hwdownload,format=nv12" -frames:v 20 -f rawvideo -pix_fmt nv12 /tmp/out.yuv
# → core dump, /tmp/out.yuv stays empty

Same input + same args run against mpv --hwdec=v4l2request-copy works (60 s clip in 6 s = 236 FPS) — kernel + libavcodec HEVC path is fine; only the libva backend's HEVC code path crashes.

Core dump trace

Stack trace of thread 12613:
#0  v4l2_request_drv_video.so + 0x9d74      ← crash site
#1  v4l2_request_drv_video.so + 0x5148      ← caller
#2  vaEndPicture (libva.so.2 + 0x6a1c)
#3  libavcodec.so.62 + 0x895050             ← ff_vaapi_decode_issue

Addresses unresolvable on the stripped CI binary (addr2line returns the only exported symbol __vaDriverInit_1_23 for every offset). My local debug build has different layout so symbol mapping by offset isn't reliable across builds.

Build-flag bisection

Same source 7ac934e, four binary variants tested on fresnel:

Build md5 Size HEVC ffmpeg HEVC mpv-fourier Firefox HEVC
meson setup build (default debugoptimized, no strip) 0c9a7efa… 485 KB ✓ exit 0 ✓ HW ENGAGED
arch-meson build --buildtype=release (mirrors PKGBUILD) 06650dc6… 145 KB ✓ exit 0 ✓ HW ENGAGED
arch-meson build --buildtype=release + CFLAGS=-flto 7b4387f1… 76 KB SEGV ✗ silent fail
CI build (marfrit-packages PKGBUILD via makepkg on fermi) ae611d80… 133 KB SEGV ✗ silent fail

Pattern:

  • LTO-forced local build (76 KB) crashes the same way as the CI build.
  • The CI binary's 133 KB size sits between 76 (LTO) and 145 (no LTO) — i.e. the CI is applying some form of aggressive optimisation that the in-PKGBUILD arch-meson --buildtype=release should not be doing alone. Something extra is being set at makepkg time.

makepkg.conf on both fermi (CI builder) and fresnel (consumer / local builder) reads OPTIONS=(strip docs !libtool !staticlibs emptydirs zipman purge !debug !lto) — i.e. LTO is explicitly opted out in both. So the source of LTO (or LTO-shaped behaviour) producing the CI binary is somewhere outside the visible config.

Candidates to chase:

  1. arch-meson's wrapper unconditionally sets -Db_pie=true and --buildtype=plain (then PKGBUILD reapplies --buildtype=release). The interaction with b_pie on aarch64 may pull in -fipa-icf / -flto-partition=one / cross-function ICF that has the same dead-code-eliminating effect as LTO under arch-meson defaults. Confirm what flags the actual gcc invocation got.

  2. fermi's makepkg may be picking up a user-level override file (~/.makepkg.conf, pacman-key's package signing wrapper, or a buildah/podman environment file from the CI workflow bin/marfrit-publish invocation). The repository-level workflow file (marfrit-packages/.gitea/workflows/build.yml) may inject env vars that aren't in /etc/makepkg.conf.

  3. ABI mismatch: linux-api-headers 6.19-1, libva 2.23.0-1, libdrm 2.4.133-1 are byte-identical on fermi vs fresnel (I sha256'd /usr/include/linux/v4l2-controls.h on both — same), so the V4L2 stateless HEVC control structs are layout-identical. Header skew is not the cause.

  4. Compiler version skew: not yet measured. Worth recording gcc --version on both hosts in the CI log.

Why HEVC specifically (and not the other 4)

HEVC is the only one of the five campaign codecs that submits a chain of multiple control structs per frame via vaRenderPicture (SPS + PPS + DECODE_PARAMS + SLICE_PARAMS), and the iter25/iter31 fixes hinge on which struct a given field lives in (the st_rps_bits slice-vs-decode-params routing). If a function that copies one of these structs's payload (e.g. the per-codec hevc_set_controls helper, or media_request_queue calling into a helper that's been ICF'd against an MPEG-2 lookalike) is wrongly merged or removed by aggressive optimisation, HEVC blows up while the simpler-control-chain codecs survive.

The crash offset +0x9d74 (in the source-and-build-flag-resolution my local symbol map suggests) lands within ~40 bytes of v4l2_timeval_to_ns (84-byte helper). If v4l2_timeval_to_ns got ICF-merged with a similarly-shaped helper from the HEVC path, the wrong instance's local stack layout would be invoked and a structured-fill loop in HEVC would overshoot — that's consistent with a crash after vaEndPicture finishes the queue submission, in a per-codec helper.

Workaround on consumer hosts (until the build is fixed)

git clone https://git.reauktion.de/marfrit/libva-v4l2-request-fourier
cd libva-v4l2-request-fourier
meson setup build && ninja -C build
sudo install -m644 build/src/v4l2_request_drv_video.so /usr/lib/dri/

pacman will keep reporting the file as owned by the package, but pacman -Syu will silently overwrite it back to the broken CI build on next package upgrade.

Suggested fix path

  1. Pin meson flags explicitly in PKGBUILD: add options=('!strip' '!lto') and verify that what gets shipped is the 145 KB binary (matches my local arch-meson reproduction).
  2. Or add meson_options overrides: -Db_lto=false -Db_ndebug=false and skip the strip step in package().
  3. As a smoke test in the workflow: after makepkg produces the .pkg.tar.zst, run pacman -U <pkg> in a chroot, exercise an HEVC decode via ffmpeg -hwaccel vaapi, fail the workflow if the exit code is non-zero. Cheap regression check; catches this class of bug pre-publish.

Severity

The campaign-close criterion (libva == kdirect byte-exact on all 5 codecs) holds only on the hand-built backend. Anyone installing from the marfrit repo today gets a regression. Two consumers definitely affected: ffmpeg -hwaccel vaapi HEVC path, firefox-fourier HEVC autoplay (HW path silently fails after the codec is selected — content stays SW-decoded). mpv --hwdec=v4l2request-copy is unaffected because it bypasses libva entirely.

Refs

  • Source commit: marfrit/libva-v4l2-request-fourier @ 7ac934e (iter38b: bounds check uses MAX_PROFILES)
  • iter38 campaign close (HEVC certified byte-exact, hand-built): marfrit/fresnel-fourier @ e66c5c0
  • Per-codec sensitivity context: campaign memory feedback_va_st_rps_bits_is_slice_field (the iter31 fix)
  • Two reproducible core dumps on fresnel:
    • /var/lib/systemd/coredump/core.ffmpeg.1000.ce5a0dca7def454eaafa3e2383970a76.12611.1778870067000000.zst
    • /var/lib/systemd/coredump/core.ffmpeg.1000.ce5a0dca7def454eaafa3e2383970a76.12635.1778870069000000.zst
# libva-v4l2-request-fourier: CI binary 4× smaller than hand-build + HEVC SEGFAULT — LTO suspected `libva-v4l2-request-fourier 1.0.0.r348.7ac934e-1` (published in `marfrit` aarch64 repo, built today on `fermi` LXC) **segfaults on HEVC decode in `vaEndPicture` path**. Same source `7ac934e` built locally on the consumer host (fresnel, Pinebook Pro) produces a working binary. Likely aggressive optimization in the CI build pipeline that the PKGBUILD doesn't pin. Affects the iter38 close: HEVC was certified byte-exact on May 14 with the hand-built backend; today's first attempt to consume the **packaged** backend reverted that result. The other 4 codecs (H.264, VP8, VP9, MPEG-2) work fine on both the CI and the hand-built binary — only HEVC trips. ## Reproducer On fresnel after `pacman -S libva-v4l2-request-fourier`: ```sh LIBVA_DRIVER_NAME=v4l2_request \ ffmpeg -hide_banner -loglevel error \ -hwaccel vaapi -hwaccel_output_format vaapi \ -i ~/measurements/encoded/bbb_60s_720p.hevc.mp4 \ -vf "hwdownload,format=nv12" -frames:v 20 -f rawvideo -pix_fmt nv12 /tmp/out.yuv # → core dump, /tmp/out.yuv stays empty ``` Same input + same args run against `mpv --hwdec=v4l2request-copy` works (60 s clip in 6 s = 236 FPS) — kernel + libavcodec HEVC path is fine; only the libva backend's HEVC code path crashes. ## Core dump trace ``` Stack trace of thread 12613: #0 v4l2_request_drv_video.so + 0x9d74 ← crash site #1 v4l2_request_drv_video.so + 0x5148 ← caller #2 vaEndPicture (libva.so.2 + 0x6a1c) #3 libavcodec.so.62 + 0x895050 ← ff_vaapi_decode_issue ``` Addresses unresolvable on the stripped CI binary (`addr2line` returns the only exported symbol `__vaDriverInit_1_23` for every offset). My local debug build has different layout so symbol mapping by offset isn't reliable across builds. ## Build-flag bisection Same source `7ac934e`, four binary variants tested on fresnel: | Build | md5 | Size | HEVC ffmpeg | HEVC mpv-fourier | Firefox HEVC | |------------------------------------------------------------|-------------|------------|-------------|------------------|--------------| | `meson setup build` (default `debugoptimized`, no strip) | `0c9a7efa…` | **485 KB** | ✓ exit 0 | ✓ | ✓ HW ENGAGED | | `arch-meson build --buildtype=release` (mirrors PKGBUILD) | `06650dc6…` | 145 KB | ✓ exit 0 | ✓ | ✓ HW ENGAGED | | `arch-meson build --buildtype=release` + `CFLAGS=-flto` | `7b4387f1…` | 76 KB | **SEGV** | ✓ | ✗ silent fail | | **CI build (`marfrit-packages` PKGBUILD via makepkg on fermi)** | `ae611d80…` | **133 KB** | **SEGV** | ✓ | ✗ silent fail | Pattern: - LTO-forced local build (76 KB) crashes the same way as the CI build. - The CI binary's 133 KB size sits between 76 (LTO) and 145 (no LTO) — i.e. the CI is applying *some* form of aggressive optimisation that the in-PKGBUILD `arch-meson --buildtype=release` should not be doing alone. Something extra is being set at makepkg time. makepkg.conf on **both** fermi (CI builder) and fresnel (consumer / local builder) reads `OPTIONS=(strip docs !libtool !staticlibs emptydirs zipman purge !debug !lto)` — i.e. LTO is explicitly **opted out** in both. So the source of LTO (or LTO-shaped behaviour) producing the CI binary is somewhere outside the visible config. Candidates to chase: 1. `arch-meson`'s wrapper unconditionally sets `-Db_pie=true` and `--buildtype=plain` (then PKGBUILD reapplies `--buildtype=release`). The interaction with `b_pie` on aarch64 may pull in `-fipa-icf` / `-flto-partition=one` / cross-function ICF that has the same dead-code-eliminating effect as LTO under arch-meson defaults. Confirm what flags the actual `gcc` invocation got. 2. fermi's makepkg may be picking up a *user-level* override file (`~/.makepkg.conf`, `pacman-key`'s package signing wrapper, or a buildah/podman environment file from the CI workflow `bin/marfrit-publish` invocation). The repository-level workflow file (`marfrit-packages/.gitea/workflows/build.yml`) may inject env vars that aren't in `/etc/makepkg.conf`. 3. ABI mismatch: `linux-api-headers 6.19-1`, `libva 2.23.0-1`, `libdrm 2.4.133-1` are byte-identical on fermi vs fresnel (I sha256'd `/usr/include/linux/v4l2-controls.h` on both — same), so the V4L2 stateless HEVC control structs are layout-identical. Header skew is **not** the cause. 4. Compiler version skew: not yet measured. Worth recording `gcc --version` on both hosts in the CI log. ## Why HEVC specifically (and not the other 4) HEVC is the only one of the five campaign codecs that submits a chain of multiple control structs per frame via vaRenderPicture (SPS + PPS + DECODE_PARAMS + SLICE_PARAMS), and the iter25/iter31 fixes hinge on which struct a given field lives in (the `st_rps_bits` slice-vs-decode-params routing). If a function that copies one of these structs's payload (e.g. the per-codec `hevc_set_controls` helper, or `media_request_queue` calling into a helper that's been ICF'd against an MPEG-2 lookalike) is wrongly merged or removed by aggressive optimisation, HEVC blows up while the simpler-control-chain codecs survive. The crash offset `+0x9d74` (in the source-and-build-flag-resolution my local symbol map suggests) lands within ~40 bytes of `v4l2_timeval_to_ns` (84-byte helper). If `v4l2_timeval_to_ns` got ICF-merged with a similarly-shaped helper from the HEVC path, the wrong instance's local stack layout would be invoked and a structured-fill loop in HEVC would overshoot — that's consistent with a crash *after* `vaEndPicture` finishes the queue submission, in a per-codec helper. ## Workaround on consumer hosts (until the build is fixed) ```sh git clone https://git.reauktion.de/marfrit/libva-v4l2-request-fourier cd libva-v4l2-request-fourier meson setup build && ninja -C build sudo install -m644 build/src/v4l2_request_drv_video.so /usr/lib/dri/ ``` pacman will keep reporting the file as owned by the package, but `pacman -Syu` will silently overwrite it back to the broken CI build on next package upgrade. ## Suggested fix path 1. **Pin meson flags explicitly in PKGBUILD**: add `options=('!strip' '!lto')` and verify that what gets shipped is the 145 KB binary (matches my local arch-meson reproduction). 2. **Or add `meson_options` overrides**: `-Db_lto=false -Db_ndebug=false` and skip the strip step in `package()`. 3. **As a smoke test in the workflow**: after `makepkg` produces the .pkg.tar.zst, run `pacman -U <pkg>` in a chroot, exercise an HEVC decode via `ffmpeg -hwaccel vaapi`, fail the workflow if the exit code is non-zero. Cheap regression check; catches this class of bug pre-publish. ## Severity The campaign-close criterion (libva == kdirect byte-exact on all 5 codecs) holds **only on the hand-built backend**. Anyone installing from the marfrit repo today gets a regression. Two consumers definitely affected: `ffmpeg -hwaccel vaapi` HEVC path, `firefox-fourier` HEVC autoplay (HW path silently fails after the codec is selected — content stays SW-decoded). `mpv --hwdec=v4l2request-copy` is unaffected because it bypasses libva entirely. ## Refs - Source commit: `marfrit/libva-v4l2-request-fourier @ 7ac934e` (iter38b: bounds check uses MAX_PROFILES) - iter38 campaign close (HEVC certified byte-exact, hand-built): `marfrit/fresnel-fourier @ e66c5c0` - Per-codec sensitivity context: campaign memory `feedback_va_st_rps_bits_is_slice_field` (the iter31 fix) - Two reproducible core dumps on fresnel: - `/var/lib/systemd/coredump/core.ffmpeg.1000.ce5a0dca7def454eaafa3e2383970a76.12611.1778870067000000.zst` - `/var/lib/systemd/coredump/core.ffmpeg.1000.ce5a0dca7def454eaafa3e2383970a76.12635.1778870069000000.zst`
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/marfrit-packages#17