iter39: extend auto-probe to a 3rd fd for RK3588 rockchip,rk3588-av1-vpu-dec #2

Closed
opened 2026-05-16 07:27:04 +00:00 by claude-noether · 3 comments
Collaborator

iter39: extend auto-probe to a 3rd fd for RK3588 rockchip,rk3588-av1-vpu-dec

Motivation

On ampere (RK3588), the V4L2 decoder topology is three independent decoder cores, not two:

/dev/media0  rkvdec                                  HEVC + H.264
/dev/media1  hantro-vpu  rockchip,rk3568-vpu-dec     MPEG-2 + VP8 + H.264
/dev/media3  hantro-vpu  rockchip,rk3588-av1-vpu-dec AV1  ← new!

(/dev/media2 is rockchip,rk3588-vepu121-enc — encoder, out of scope.)

The libva-v4l2-request-fourier auto-probe introduced in iter38 (request.c::request_data carrying video_fd_rkvdec + media_fd_rkvdec + video_fd_hantro + media_fd_hantro) is hard-capped at 2 fds: one rkvdec + one hantro instance. The RK3588 av1-vpu-dec is also a hantro-driven device but it's a different hantro instance from the generic rk3568-vpu-dec one (different DT compatible, different /dev/video* node, different supported pixfmts). The iter38 probe picks the first hantro it finds and skips the rest — /dev/video4 (AV1) is therefore never opened.

Result: vainfo does not enumerate VAProfileAV1 on ampere even though the kernel exposes AV1F cleanly via v4l2-ctl -d /dev/video4 --list-formats-out.

Proposed iter39

Generalize struct request_data from per-kind fixed slots to a table or array of (fd, media_fd, driver_kind, codec_set) tuples. Three sub-changes:

  1. Storage: replace video_fd_rkvdec + …_hantro (and helpers request_device_kind_for_profile() returning 'r' / 'h') with struct decoder_slot decoders[MAX_DECODERS]; where each slot tracks fd, media-fd, driver-kind (rkvdec / hantro-rk3568 / hantro-rk3588-av1 / …), and the V4L2 OUTPUT pixfmts it advertises.
  2. Probe: at VA_DRIVER_INIT, walk /dev/media* (or /dev/video* — pick the simpler enumeration), for each media node read driver name + card type via MEDIA_IOC_DEVICE_INFO, classify into a known driver-kind via a small table, and add a slot. Cap at a sensible MAX_DECODERS (3 or 4 for the current fleet).
  3. Dispatch: RequestCreateConfig per-profile routing becomes a table lookup: profile → required driver-kind → find a matching slot → switch active driver_data->video_fd / media_fd. Or keep the slot-array indexed access alive past CreateConfig if the decode loop ever needs to retarget mid-frame (it doesn't today, but the API allows).

RequestQueryConfigProfiles already iterates fds + unions per iter38b's any_fd_supports_output_format() helper; just extend that helper to iterate the slot array.

Boundary

This is pure userspace, backend-only. No kernel changes. The kernel already exposes the AV1 decoder; the backend just needs to look for it.

The same generalization helps any future RK35xx variant that grows additional decoder cores (e.g. a hypothetical rk3588-vp9-dec if VP9 lands as its own dedicated block instead of inside rkvdec).

Acceptance

vainfo on ampere lists VAProfileAV1. ampere-fourier iter4 (planned) validates AV1 decode end-to-end via the iter39 backend through /dev/video4 against an AV1-encoded test clip (clip provenance TBD — BBB isn't AV1 by default).

Out of scope for iter39

  • Adding a 4th fd for any of the missing multi-core support, ignoring this instance cores. Those nodes don't even register, so they're not user-visible. Multi-core glue is upstream-kernel work.
  • AV1 source clip — sourcing or re-encoding test material is operator decision, not iter39 scope.
  • AV1 byte-exactness criteria (Phase 4 C3/C4 equivalent) — iter4's Phase 1 picks those.

Refs

  • ampere-fourier iter1 Phase 0: ~/src/ampere-fourier/phase0_findings.md (decoder topology evidence)
  • Memory feedback_multi_device_probe_design (iter38 architecture, the iter39 generalizes)
  • Backend source pin: marfrit/libva-v4l2-request-fourier @ 7ac934e
# iter39: extend auto-probe to a 3rd fd for RK3588 `rockchip,rk3588-av1-vpu-dec` ## Motivation On ampere (RK3588), the V4L2 decoder topology is **three independent decoder cores**, not two: ``` /dev/media0 rkvdec HEVC + H.264 /dev/media1 hantro-vpu rockchip,rk3568-vpu-dec MPEG-2 + VP8 + H.264 /dev/media3 hantro-vpu rockchip,rk3588-av1-vpu-dec AV1 ← new! ``` (`/dev/media2` is `rockchip,rk3588-vepu121-enc` — encoder, out of scope.) The libva-v4l2-request-fourier auto-probe introduced in iter38 (`request.c::request_data` carrying `video_fd_rkvdec` + `media_fd_rkvdec` + `video_fd_hantro` + `media_fd_hantro`) is hard-capped at **2 fds: one rkvdec + one hantro instance**. The RK3588 av1-vpu-dec is also a hantro-driven device but it's a *different* hantro instance from the generic `rk3568-vpu-dec` one (different DT compatible, different `/dev/video*` node, different supported pixfmts). The iter38 probe picks the first hantro it finds and skips the rest — `/dev/video4` (AV1) is therefore never opened. Result: `vainfo` does not enumerate `VAProfileAV1` on ampere even though the kernel exposes `AV1F` cleanly via `v4l2-ctl -d /dev/video4 --list-formats-out`. ## Proposed iter39 Generalize `struct request_data` from per-kind fixed slots to a table or array of `(fd, media_fd, driver_kind, codec_set)` tuples. Three sub-changes: 1. **Storage**: replace `video_fd_rkvdec` + `…_hantro` (and helpers `request_device_kind_for_profile()` returning `'r'` / `'h'`) with `struct decoder_slot decoders[MAX_DECODERS];` where each slot tracks fd, media-fd, driver-kind (`rkvdec` / `hantro-rk3568` / `hantro-rk3588-av1` / …), and the V4L2 OUTPUT pixfmts it advertises. 2. **Probe**: at VA_DRIVER_INIT, walk `/dev/media*` (or `/dev/video*` — pick the simpler enumeration), for each media node read driver name + card type via `MEDIA_IOC_DEVICE_INFO`, classify into a known driver-kind via a small table, and add a slot. Cap at a sensible MAX_DECODERS (3 or 4 for the current fleet). 3. **Dispatch**: `RequestCreateConfig` per-profile routing becomes a table lookup: profile → required driver-kind → find a matching slot → switch active `driver_data->video_fd` / `media_fd`. Or keep the slot-array indexed access alive past CreateConfig if the decode loop ever needs to retarget mid-frame (it doesn't today, but the API allows). `RequestQueryConfigProfiles` already iterates fds + unions per iter38b's `any_fd_supports_output_format()` helper; just extend that helper to iterate the slot array. ## Boundary This is **pure userspace, backend-only**. No kernel changes. The kernel already exposes the AV1 decoder; the backend just needs to look for it. The same generalization helps any future RK35xx variant that grows additional decoder cores (e.g. a hypothetical `rk3588-vp9-dec` if VP9 lands as its own dedicated block instead of inside rkvdec). ## Acceptance `vainfo` on ampere lists `VAProfileAV1`. ampere-fourier iter4 (planned) validates AV1 decode end-to-end via the iter39 backend through `/dev/video4` against an AV1-encoded test clip (clip provenance TBD — BBB isn't AV1 by default). ## Out of scope for iter39 - Adding a 4th fd for any of the `missing multi-core support, ignoring this instance` cores. Those nodes don't even register, so they're not user-visible. Multi-core glue is upstream-kernel work. - AV1 source clip — sourcing or re-encoding test material is operator decision, not iter39 scope. - AV1 byte-exactness criteria (Phase 4 C3/C4 equivalent) — iter4's Phase 1 picks those. ## Refs - ampere-fourier iter1 Phase 0: `~/src/ampere-fourier/phase0_findings.md` (decoder topology evidence) - Memory `feedback_multi_device_probe_design` (iter38 architecture, the iter39 generalizes) - Backend source pin: `marfrit/libva-v4l2-request-fourier @ 7ac934e`
Author
Collaborator

Triage refresh 2026-05-18. Still valid. Acceptance criterion (vainfo on ampere lists VAProfileAV1) is unchanged — no progress on the AV1 enumeration goal.

What did happen instead (partial pattern precedent)

iter40 added a 3rd hardcoded fd pair to request.h for the Pi 5 HEVC decoder, NOT for ampere's AV1:

int video_fd_rkvdec;    int media_fd_rkvdec;       // iter38
int video_fd_hantro;    int media_fd_hantro;       // iter38
int video_fd_rpi_hevc_dec; int media_fd_rpi_hevc_dec;  // iter40

The iter40 commit (3ffa9d0 iter40: Pi 5 HEVC chapter — backend integration lands) shows the same shape the iter38 multi-device-probe established. So the architectural pattern works for 3 decoder kinds.

But the proposed generalization in this issue — struct decoder_slot decoders[MAX_DECODERS] array — was NOT taken. Each new decoder kind currently means hardcoding another pair.

Operator decision paths

a) Minimal-delta path (matches established iter40 precedent): add a 4th hardcoded pair video_fd_hantro_av1 + media_fd_hantro_av1, classifier in request_probe recognizing rk3588-av1-vpu-dec, dispatch in RequestCreateConfig for VAProfileAV1. ~50 LOC, mirrors iter40 exactly. Drawback: codebase carries N hardcoded pairs, will need same surgery for every future decoder kind.

b) Generalize now (this issue's original proposal): refactor to struct decoder_slot decoders[MAX_DECODERS] array, rewrite probe + dispatch as table-driven, migrate the existing 3 pairs into the array. Larger change (~200 LOC + test rewrite for iter38-baseline preservation), but caps future growth.

c) Wait for AV1 priority: AV1 source clip + Phase 4 byte-exactness criteria are also unaddressed (issue's own out-of-scope list). If nothing is actively decoding AV1 in the fleet today, this can wait — once a real consumer materializes, do (a) for the AV1 enablement plus (b) preemptively.

Recommend (a) at the time the fleet has a concrete AV1 use case (firefox-fourier playing YouTube AV1 streams is a candidate driver — see daedalus-fourier README for the YouTube ∩ Pi5-HW = ∅ context, which makes ampere the right hardware for YouTube AV1). If no concrete consumer arrives in, say, a month, defer (c).

The bigger refactor (b) is best landed concurrently with another decoder-kind addition where the operator wants to amortize the cost, not as a standalone cleanup pass.

Keeping open as waiting-on-operator-priority. No empirical reproduction needed for this one — symptom (vainfo lacks AV1) is invariant until the backend gains the 4th-fd code path.

**Triage refresh 2026-05-18.** Still valid. Acceptance criterion (`vainfo on ampere lists VAProfileAV1`) is unchanged — no progress on the AV1 enumeration goal. ## What did happen instead (partial pattern precedent) iter40 added a 3rd hardcoded fd pair to `request.h` for the Pi 5 HEVC decoder, NOT for ampere's AV1: ```c int video_fd_rkvdec; int media_fd_rkvdec; // iter38 int video_fd_hantro; int media_fd_hantro; // iter38 int video_fd_rpi_hevc_dec; int media_fd_rpi_hevc_dec; // iter40 ``` The iter40 commit (`3ffa9d0 iter40: Pi 5 HEVC chapter — backend integration lands`) shows the same shape the iter38 multi-device-probe established. So the architectural pattern works for 3 decoder kinds. But the proposed generalization in this issue — `struct decoder_slot decoders[MAX_DECODERS]` array — was NOT taken. Each new decoder kind currently means hardcoding another pair. ## Operator decision paths a) **Minimal-delta path** (matches established iter40 precedent): add a 4th hardcoded pair `video_fd_hantro_av1` + `media_fd_hantro_av1`, classifier in `request_probe` recognizing `rk3588-av1-vpu-dec`, dispatch in `RequestCreateConfig` for `VAProfileAV1`. ~50 LOC, mirrors iter40 exactly. Drawback: codebase carries N hardcoded pairs, will need same surgery for every future decoder kind. b) **Generalize now** (this issue's original proposal): refactor to `struct decoder_slot decoders[MAX_DECODERS]` array, rewrite probe + dispatch as table-driven, migrate the existing 3 pairs into the array. Larger change (~200 LOC + test rewrite for iter38-baseline preservation), but caps future growth. c) **Wait for AV1 priority**: AV1 source clip + Phase 4 byte-exactness criteria are also unaddressed (issue's own out-of-scope list). If nothing is actively decoding AV1 in the fleet today, this can wait — once a real consumer materializes, do (a) for the AV1 enablement plus (b) preemptively. Recommend (a) at the time the fleet has a concrete AV1 use case (firefox-fourier playing YouTube AV1 streams is a candidate driver — see `daedalus-fourier` README for the YouTube ∩ Pi5-HW = ∅ context, which makes ampere the right hardware for YouTube AV1). If no concrete consumer arrives in, say, a month, defer (c). The bigger refactor (b) is best landed concurrently with another decoder-kind addition where the operator wants to amortize the cost, not as a standalone cleanup pass. Keeping open as waiting-on-operator-priority. No empirical reproduction needed for this one — symptom (`vainfo` lacks AV1) is invariant until the backend gains the 4th-fd code path.
Author
Collaborator

Headline acceptance criterion met 2026-05-18 — one-line fix.

What turned out to be the bug

ampere's av1-iter1 branch has been doing the heavy lifting for months — Phase 2 step 2 (commit 61db76e) added the VAProfileAV1Profile0 enumeration in config.c; Phase 2 step 1 (commit bed75c0) added the video_fd_vpu981 slot + AV1F-discriminating probe; Phase 2 step 4 (78a9978) added ~500 LoC AV1 dispatch scaffolding; Phase 3 (d7ef0f6) reached 3/10 frames bit-exact vs kdirect. But vainfo was still not listing the profile.

Diagnosed via strace:

  1. Probe IS working: v4l2-request: ampere-av1: vpu981 AV1 decoder at /dev/video4 + /dev/media3
  2. ENUM_FMT on the vpu981 fd correctly returns AV1F as OUTPUT_MPLANE pixfmt at index 0
  3. any_fd_supports_output_format() helper returns true
  4. The enumeration guard if (found && index < (V4L2_REQUEST_MAX_PROFILES - 1)) fails because by the time the AV1 push is reached, 10 profiles are already in profiles[] (MPEG2×2 + H264×5 + HEVC + VP8 + VP9), index = 10, MAX_PROFILES = 11, so 10 < 11 - 1 = 10 → false → AV1 silently dropped.

Off-by-one in the bounds-check pattern. Comment at the AV1 push said "MAX_PROFILES=11 is now EXACTLY full with this addition" — but the < MAX - 1 guard requires room for ANOTHER slot AFTER the push, so EXACTLY full doesn't fit through the guard.

Fix

-#define V4L2_REQUEST_MAX_PROFILES        11
+#define V4L2_REQUEST_MAX_PROFILES        12

One-line bump in src/request.h. The other guards each take their own N-off-the-top for their push groups; their semantics are unchanged. context->max_profiles = V4L2_REQUEST_MAX_PROFILES sizes the consumer-side array automatically.

Verification 2026-05-18

$ env LIBVA_DRIVER_NAME=v4l2_request vainfo | grep VAProfile
      ... (all prior profiles unchanged) ...
      VAProfileAV1Profile0            :   VAEntrypointVLD   ✓

Acceptance criterion met (vainfo on ampere lists VAProfileAV1).

State of the fix

Committed on ampere's local av1-iter1 branch as d21feba. Ampere's git remote only has origin=marfrit/libva-v4l2-request-fourier (no claude-noether SSH alias configured there), so I haven't pushed. Options:

a) Operator picks up d21feba and merges into the in-progress av1-iter1 work directly.
b) I add a claude-noether remote on ampere + push to claude-noether's fork + open a small PR for the 1-line fix.
c) I create a fresh branch on noether's checkout, port the 1-line change there, push via claude-noether normal flow.

The rest of the av1-iter1 work (3/10 frames bit-exact, film_grain handling, dispatch scaffolding) is in flight in the operator's own iteration — issue #2's headline ask is met today but the full AV1 decode bit-exact pass is Phase 4 work that this issue's "out of scope" section calls out.

Recommend closing this issue once the operator picks an integration path for the 1-line fix.

Cross-references: VP9 enablement issue #12 closed today with the same shape (1-line any_fd_supports_output_format extension covered AV1 too as a side effect). feedback_no_bbb_intro_frames is the cross-cutting discipline that this work also benefited from.

**Headline acceptance criterion met 2026-05-18 — one-line fix.** ## What turned out to be the bug ampere's `av1-iter1` branch has been doing the heavy lifting for months — Phase 2 step 2 (commit `61db76e`) added the `VAProfileAV1Profile0` enumeration in `config.c`; Phase 2 step 1 (commit `bed75c0`) added the `video_fd_vpu981` slot + AV1F-discriminating probe; Phase 2 step 4 (`78a9978`) added ~500 LoC AV1 dispatch scaffolding; Phase 3 (`d7ef0f6`) reached 3/10 frames bit-exact vs kdirect. But `vainfo` was still not listing the profile. Diagnosed via strace: 1. Probe IS working: `v4l2-request: ampere-av1: vpu981 AV1 decoder at /dev/video4 + /dev/media3` 2. ENUM_FMT on the vpu981 fd correctly returns AV1F as OUTPUT_MPLANE pixfmt at index 0 3. `any_fd_supports_output_format()` helper returns true 4. **The enumeration guard `if (found && index < (V4L2_REQUEST_MAX_PROFILES - 1))` fails** because by the time the AV1 push is reached, 10 profiles are already in `profiles[]` (MPEG2×2 + H264×5 + HEVC + VP8 + VP9), `index = 10`, `MAX_PROFILES = 11`, so `10 < 11 - 1 = 10` → false → AV1 silently dropped. Off-by-one in the bounds-check pattern. Comment at the AV1 push said "MAX_PROFILES=11 is now EXACTLY full with this addition" — but the `< MAX - 1` guard requires room for ANOTHER slot AFTER the push, so EXACTLY full doesn't fit through the guard. ## Fix ```diff -#define V4L2_REQUEST_MAX_PROFILES 11 +#define V4L2_REQUEST_MAX_PROFILES 12 ``` One-line bump in `src/request.h`. The other guards each take their own N-off-the-top for their push groups; their semantics are unchanged. `context->max_profiles = V4L2_REQUEST_MAX_PROFILES` sizes the consumer-side array automatically. ## Verification 2026-05-18 ``` $ env LIBVA_DRIVER_NAME=v4l2_request vainfo | grep VAProfile ... (all prior profiles unchanged) ... VAProfileAV1Profile0 : VAEntrypointVLD ✓ ``` Acceptance criterion met (`vainfo on ampere lists VAProfileAV1`). ## State of the fix Committed on ampere's local `av1-iter1` branch as `d21feba`. Ampere's git remote only has `origin=marfrit/libva-v4l2-request-fourier` (no claude-noether SSH alias configured there), so I haven't pushed. Options: a) Operator picks up `d21feba` and merges into the in-progress av1-iter1 work directly. b) I add a claude-noether remote on ampere + push to claude-noether's fork + open a small PR for the 1-line fix. c) I create a fresh branch on noether's checkout, port the 1-line change there, push via claude-noether normal flow. The rest of the av1-iter1 work (3/10 frames bit-exact, film_grain handling, dispatch scaffolding) is in flight in the operator's own iteration — issue #2's headline ask is met today but the full AV1 decode bit-exact pass is Phase 4 work that this issue's "out of scope" section calls out. Recommend closing this issue once the operator picks an integration path for the 1-line fix. Cross-references: VP9 enablement issue #12 closed today with the same shape (1-line `any_fd_supports_output_format` extension covered AV1 too as a side effect). [`feedback_no_bbb_intro_frames`](../../../../home/mfritsche/.claude/projects/-home-mfritsche-src-fresnel-fourier/memory/feedback_no_bbb_intro_frames.md) is the cross-cutting discipline that this work also benefited from.
Author
Collaborator

Closing — headline ask delivered.

PR #5 merged: master now has the vpu981 4th-fd probe + VAProfileAV1Profile0 enumeration + the defensive MAX_PROFILES bump 13 → 14 (per the off-by-one logic correctness — total possible enumeration if iter39 Option B reverts = 13, guards need MAX ≥ 14).

vainfo on ampere lists VAProfileAV1Profile0 ✓ (verified pre-merge against the patched build; ampere's running .so then restored to the in-progress av1-iter1 build so the operator's Phase 3-5 bit-exact work isn't disrupted).

End-to-end AV1 decode bit-exact is iter4 work that the av1-iter1 operator branch continues to drive (~500 LoC av1.c dispatch + film_grain wiring + reference_frame_ts plumbing). When that lands, the merge against today's master will be small (mostly the av1.{c,h} additions + the picture.c dispatch hook).

Reopen criteria

  • AV1 fixture sourced + first browser-side VP9/AV1 stress test on firefox-fourier surfaces a libva-AV1-specific bug not covered by av1-iter1 — reopen with a fresh "[browser]" tag.
  • AV1 multi-core support requested (RK3588 has 2 vpu381 cores, mirror of the VP9 multi-core question from issue #12) — separate ticket.

Today's coverage:

  • kernel-side VP9 — bit-exact via Sarma patches (issue #12)
  • HW path VP9 + AV1 enumerated in vainfo + ready for browser consumers
  • libva-side ffmpeg-vaapi NV15→P010 unpack for Hi10P/Main10 surfaces (marfrit-packages#21)
  • libva-side bit-exact AV1 decode (av1-iter1, operator-driven)
**Closing — headline ask delivered.** PR #5 merged: master now has the `vpu981` 4th-fd probe + `VAProfileAV1Profile0` enumeration + the defensive MAX_PROFILES bump 13 → 14 (per the off-by-one logic correctness — total possible enumeration if iter39 Option B reverts = 13, guards need MAX ≥ 14). vainfo on ampere lists `VAProfileAV1Profile0` ✓ (verified pre-merge against the patched build; ampere's running `.so` then restored to the in-progress `av1-iter1` build so the operator's Phase 3-5 bit-exact work isn't disrupted). End-to-end AV1 decode bit-exact is iter4 work that the `av1-iter1` operator branch continues to drive (~500 LoC av1.c dispatch + film_grain wiring + reference_frame_ts plumbing). When that lands, the merge against today's master will be small (mostly the av1.{c,h} additions + the picture.c dispatch hook). ## Reopen criteria - AV1 fixture sourced + first browser-side VP9/AV1 stress test on firefox-fourier surfaces a libva-AV1-specific bug not covered by av1-iter1 — reopen with a fresh "[browser]" tag. - AV1 multi-core support requested (RK3588 has 2 vpu381 cores, mirror of the VP9 multi-core question from issue #12) — separate ticket. Today's coverage: - ✅ kernel-side VP9 — bit-exact via Sarma patches (issue #12) - ✅ HW path VP9 + AV1 enumerated in vainfo + ready for browser consumers - ✅ libva-side ffmpeg-vaapi NV15→P010 unpack for Hi10P/Main10 surfaces (marfrit-packages#21) - ⏳ libva-side bit-exact AV1 decode (av1-iter1, operator-driven)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/libva-v4l2-request-fourier#2