Firefox VAAPI present path: NV12 dmabuf zero-copy not engaging; YUV→RGB lands CPU-side, stalls the daemon decode loop #5

Open
opened 2026-05-21 09:29:52 +00:00 by marfrit · 0 comments
Owner

Symptom

With the libva → kernel → daedalus_v4l2 daemon path fully alive on Pi CM5 (PR #4, daemon journal showing per-frame REQ_DECODE codec=3 ... meta=h264 with decoder: OK 1280x720 latency 3-5 ms), Firefox YouTube playback of a 30 fps 720p H.264 stream is visibly jumpy. Sustained decode rate measured at ~23 fps (577 decoder: OK events in 25 s) — ~23 % source-frame drop on what should be the easy case relative to the documented "30fps@1080p is fine" floor.

What isn't the bottleneck

Ruled out on higgs (Pi CM5, kernel 6.18.29+rpt-rpi-2712) during 2026-05-21 diagnostic:

  • Daemon decode latency: 3-5 ms per frame, idle ~25 ms between frames.
  • Daemon CPU: 0.7 % average; single thread, never the throttle.
  • Pi: 4 cores at 2.8 GHz, 46 °C, vcgencmd get_throttled = 0x0.
  • V4L2 CAPTURE buffer count: libva-v4l2-request-fourier MIN_CAP_POOL = 24 (cap_pool.h:70); per-session log confirms cap_pool_init: 24 slots ready. rkvdec / cedrus / hantro don't enforce minimums beyond what userspace asks for, and our daedalus kernel-side queue_setup already enforces min 2 on top — so buffer pipelining headroom is fine.
  • Inter-frame REQ_DECODE arrival (kernel → daemon): ~30 ms. Daemon idle for ~25 ms of that. So the throttle is upstream of the kernel — Firefox/libva isn't submitting at 30 fps.
  • Firefox MOZ_LOG: every MediaPDecoder #N ProcessDecode line shows mDuration=33333µs (30 fps source), mTime advances in even 33333 µs steps. Firefox knows the frame rate.
  • Multi-process per-ctx vb2 lock: per-RDD MediaPDecoder #N workers all hit IsHardwareAccelerated=1 simultaneously, no EBUSY. PR #3 is doing its job under real load.

What IS the bottleneck — verified

With MOZ_LOG=Dmabuf:5,WaylandDMABuf:5,VideoBridge:5,WebRender:3, two smoking guns:

(a) Firefox's VideoFramePool for VAAPI is empty

[RDD 10077: MediaPDecoder #4]: D/Dmabuf VideoFramePool::VideoFramePool() pool size 0

This is dom/media/platforms/ffmpeg/FFmpegVideoFramePool.cpp's zero-copy pool. With size 0, every decoded frame must be uploaded by some other (CPU-side) route — VideoFramePool is the channel that would import V4L2 CAPTURE NV12 dmabufs into Firefox's GPU process as DMABufSurfaceYUV without copying.

(b) RDD → GPU texture-construct IPC fails

###!!! [Parent][DispatchAsyncMessage] Error: PVideoBridge::Msg_PTextureConstructor Value error: message was deserialized, but contained an illegal value

The RDD process is trying to hand a video texture through the PVideoBridge IPC to the GPU/parent process, and the GPU process is rejecting it as malformed. Consistent with a failed/missing dmabuf descriptor in the texture-construct message.

(c) Compositor doesn't advertise NV12 dmabuf acceptance

Dmabuf-feedback from kwin_wayland (from MOZ_LOG Dmabuf:5):

fourcc DRM format
0x30334258 "XB30" DRM_FORMAT_XBGR2101010
0x30334241 "AB30" DRM_FORMAT_ABGR2101010
0x30335258 "XR30" DRM_FORMAT_XRGB2101010
0x30335241 "AR30" DRM_FORMAT_ARGB2101010
0x34325258 "XR24" DRM_FORMAT_XRGB8888
0x34324258 "XB24" DRM_FORMAT_XBGR8888
0x34325241 "AR24" DRM_FORMAT_ARGB8888
0x34324241 "AB24" DRM_FORMAT_ABGR8888

No NV12. Compositor only accepts RGB-family dmabufs. So even if Firefox's VideoFramePool worked, the daemon's NV12 CAPTURE buffers couldn't go straight to the compositor — they need a YUV→RGB step on the way through.

Cascade

  1. Daemon delivers NV12 frames into V4L2 CAPTURE buffer (linear, dmabuf-exportable via VIDIOC_EXPBUF).
  2. libva-v4l2-request-fourier exposes these as VASurfaceIDs via vaCreateSurfaces2 / vaExportSurfaceHandle.
  3. Firefox FFmpegVideoFramePool should enroll them as DMABufSurfaceYUV for zero-copy GL/EGL import. Instead pool size 0.
  4. Firefox falls back to CPU readback of NV12 → CPU YUV→RGB conversion → GPU texture upload of RGB → compositor present.
  5. Step 4 is slow enough on Pi 5 / V3D that it back-pressures the V4L2 CAPTURE buffer release, which back-pressures the kernel→daemon REQ_DECODE flow, which drops the steady-state decode rate to ~23 fps.

Where to look

This is a daedalus-adjacent companion issue, not a daemon bug. Investigation surface:

  • firefox-fourier (mozilla-central) dom/media/platforms/ffmpeg/FFmpegVideoFramePool.cpp — why does VAAPI surface enrollment produce a 0-size pool? Are we hitting an early-bailout path on a vaExportSurfaceHandle failure, or a format-modifier mismatch (V3D may not advertise NV12 as a sampler-importable format on the EGL_EXT_image_dma_buf_import side)?
  • libva-v4l2-request-fourier vaExportSurfaceHandle — does the descriptor it produces from a V4L2 CAPTURE buffer have the modifiers Firefox/Mesa-V3D expect (DRM_FORMAT_MOD_LINEAR for V4L2_PIX_FMT_NV12)?
  • Mesa V3D — does EGL_EXT_image_dma_buf_import accept NV12 modifier=LINEAR with two-plane layout? Some V3D builds historically rejected NV12 dmabuf.
  • kwin_wayland — surface a NV12 dmabuf-feedback tranche so direct-scanout / overlay-plane path could skip the GL conversion entirely. (Stretch; depends on kwin support level on Pi.)

Architectural note

The right fix is to make the YUV→RGB step land on the V3D (VC VII) GPU as a zero-copy chain — daemon NV12 dmabuf → EGLImage → GL YUV-sampling shader → compositor. NOT to optimise the CPU-side fallback path. CPU-NEON improvements inside the daemon would be "convoluted pipeline doing nothing" (mfritsche 2026-05-21).

Repro

MOZ_ENABLE_WAYLAND=1 MOZ_VA_API_ENABLED=1 LIBVA_DRIVER_NAME=v4l2_request \
MOZ_LOG=Dmabuf:5,WaylandDMABuf:5,VideoBridge:5 \
firefox 'https://www.youtube.com/watch?v=gtLNQX_2NQc'

Grep ff.log for VideoFramePool / PVideoBridge::Msg_PTextureConstructor. Watch journalctl -u daedalus-v4l2 for inter-REQ_DECODE gaps > 20 ms during steady-state playback.

## Symptom With the libva → kernel → daedalus_v4l2 daemon path fully alive on Pi CM5 (PR #4, daemon journal showing per-frame `REQ_DECODE codec=3 ... meta=h264` with `decoder: OK 1280x720` latency 3-5 ms), Firefox YouTube playback of a 30 fps 720p H.264 stream is visibly jumpy. Sustained decode rate measured at ~23 fps (577 `decoder: OK` events in 25 s) — ~23 % source-frame drop on what should be the easy case relative to the documented "30fps@1080p is fine" floor. ## What isn't the bottleneck Ruled out on higgs (Pi CM5, kernel 6.18.29+rpt-rpi-2712) during 2026-05-21 diagnostic: - Daemon decode latency: 3-5 ms per frame, idle ~25 ms between frames. - Daemon CPU: 0.7 % average; single thread, never the throttle. - Pi: 4 cores at 2.8 GHz, 46 °C, `vcgencmd get_throttled = 0x0`. - V4L2 CAPTURE buffer count: libva-v4l2-request-fourier `MIN_CAP_POOL = 24` (cap_pool.h:70); per-session log confirms `cap_pool_init: 24 slots ready`. rkvdec / cedrus / hantro don't enforce minimums beyond what userspace asks for, and our daedalus kernel-side queue_setup already enforces `min 2` on top — so buffer pipelining headroom is fine. - Inter-frame REQ_DECODE arrival (kernel → daemon): ~30 ms. Daemon idle for ~25 ms of that. So the throttle is **upstream of the kernel** — Firefox/libva isn't submitting at 30 fps. - Firefox MOZ_LOG: every `MediaPDecoder #N ProcessDecode` line shows `mDuration=33333µs` (30 fps source), `mTime` advances in even 33333 µs steps. Firefox knows the frame rate. - Multi-process per-ctx vb2 lock: per-RDD `MediaPDecoder #N` workers all hit `IsHardwareAccelerated=1` simultaneously, no EBUSY. PR #3 is doing its job under real load. ## What IS the bottleneck — verified With `MOZ_LOG=Dmabuf:5,WaylandDMABuf:5,VideoBridge:5,WebRender:3`, two smoking guns: ### (a) Firefox's `VideoFramePool` for VAAPI is empty ``` [RDD 10077: MediaPDecoder #4]: D/Dmabuf VideoFramePool::VideoFramePool() pool size 0 ``` This is `dom/media/platforms/ffmpeg/FFmpegVideoFramePool.cpp`'s zero-copy pool. With size 0, every decoded frame must be uploaded by some other (CPU-side) route — `VideoFramePool` is the channel that would import V4L2 CAPTURE NV12 dmabufs into Firefox's GPU process as `DMABufSurfaceYUV` without copying. ### (b) RDD → GPU texture-construct IPC fails ``` ###!!! [Parent][DispatchAsyncMessage] Error: PVideoBridge::Msg_PTextureConstructor Value error: message was deserialized, but contained an illegal value ``` The RDD process is trying to hand a video texture through the `PVideoBridge` IPC to the GPU/parent process, and the GPU process is rejecting it as malformed. Consistent with a failed/missing dmabuf descriptor in the texture-construct message. ### (c) Compositor doesn't advertise NV12 dmabuf acceptance Dmabuf-feedback from kwin_wayland (from MOZ_LOG `Dmabuf:5`): | fourcc | DRM format | |---|---| | 0x30334258 "XB30" | DRM_FORMAT_XBGR2101010 | | 0x30334241 "AB30" | DRM_FORMAT_ABGR2101010 | | 0x30335258 "XR30" | DRM_FORMAT_XRGB2101010 | | 0x30335241 "AR30" | DRM_FORMAT_ARGB2101010 | | 0x34325258 "XR24" | DRM_FORMAT_XRGB8888 | | 0x34324258 "XB24" | DRM_FORMAT_XBGR8888 | | 0x34325241 "AR24" | DRM_FORMAT_ARGB8888 | | 0x34324241 "AB24" | DRM_FORMAT_ABGR8888 | **No NV12.** Compositor only accepts RGB-family dmabufs. So even if Firefox's `VideoFramePool` worked, the daemon's NV12 CAPTURE buffers couldn't go straight to the compositor — they need a YUV→RGB step on the way through. ## Cascade 1. Daemon delivers NV12 frames into V4L2 CAPTURE buffer (linear, dmabuf-exportable via `VIDIOC_EXPBUF`). 2. libva-v4l2-request-fourier exposes these as `VASurfaceID`s via `vaCreateSurfaces2` / `vaExportSurfaceHandle`. 3. Firefox `FFmpegVideoFramePool` *should* enroll them as `DMABufSurfaceYUV` for zero-copy GL/EGL import. **Instead pool size 0.** 4. Firefox falls back to CPU readback of NV12 → CPU YUV→RGB conversion → GPU texture upload of RGB → compositor present. 5. Step 4 is slow enough on Pi 5 / V3D that it back-pressures the V4L2 CAPTURE buffer release, which back-pressures the kernel→daemon REQ_DECODE flow, which drops the steady-state decode rate to ~23 fps. ## Where to look This is a daedalus-adjacent companion issue, not a daemon bug. Investigation surface: - `firefox-fourier` (mozilla-central) `dom/media/platforms/ffmpeg/FFmpegVideoFramePool.cpp` — why does VAAPI surface enrollment produce a 0-size pool? Are we hitting an early-bailout path on a `vaExportSurfaceHandle` failure, or a format-modifier mismatch (V3D may not advertise NV12 as a sampler-importable format on the EGL_EXT_image_dma_buf_import side)? - `libva-v4l2-request-fourier` `vaExportSurfaceHandle` — does the descriptor it produces from a V4L2 CAPTURE buffer have the modifiers Firefox/Mesa-V3D expect (DRM_FORMAT_MOD_LINEAR for `V4L2_PIX_FMT_NV12`)? - Mesa V3D — does `EGL_EXT_image_dma_buf_import` accept NV12 modifier=LINEAR with two-plane layout? Some V3D builds historically rejected NV12 dmabuf. - kwin_wayland — surface a NV12 dmabuf-feedback tranche so direct-scanout / overlay-plane path could skip the GL conversion entirely. (Stretch; depends on kwin support level on Pi.) ## Architectural note The right fix is to make the YUV→RGB step land on the **V3D** (VC VII) GPU as a zero-copy chain — daemon NV12 dmabuf → EGLImage → GL YUV-sampling shader → compositor. NOT to optimise the CPU-side fallback path. CPU-NEON improvements inside the daemon would be "convoluted pipeline doing nothing" (mfritsche 2026-05-21). ## Repro ``` MOZ_ENABLE_WAYLAND=1 MOZ_VA_API_ENABLED=1 LIBVA_DRIVER_NAME=v4l2_request \ MOZ_LOG=Dmabuf:5,WaylandDMABuf:5,VideoBridge:5 \ firefox 'https://www.youtube.com/watch?v=gtLNQX_2NQc' ``` Grep `ff.log` for `VideoFramePool` / `PVideoBridge::Msg_PTextureConstructor`. Watch `journalctl -u daedalus-v4l2` for inter-`REQ_DECODE` gaps > 20 ms during steady-state playback.
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: reauktion/daedalus-v4l2#5