ohm step 4: HW decode proof via gst v4l2slh264dec — 7% CPU on Wayland

GStreamer v4l2codecs path validated on ohm: HW H.264 decode plus
zero-copy dmabuf NV12 direct to waylandsink via KWin on DSI-1. 62 s
playback, 7% CPU total (≈4 s of user+sys over 62 s wall), no pipeline
errors. Negotiated caps:

  video/x-raw(memory:DMABuf), format=DMA_DRM, drm-format=NV12,
  width=1920, height=1080, framerate=24/1

— so the hantro-vpu decoder emits dmabufs that waylandsink imports
directly. That explains the massive CPU drop vs the SW mpv baseline
(127 % with 973 / 1440 frames dropped at gpu-next).

Side finding: the bbb_1080p30_h264.mp4 clip on doppler is actually 24 fps
per the SPS caps — the "1080p30" in the filename is Blender's encode
metadata, not the real rate. Recalibrated all baseline rows against
24 fps source. SW mpv drop number revised: 973 of 1440 → 68 %, not 54 %
I wrote earlier.

Also added an HW throughput row (fakesink sync=false): 36.8 fps, 89 %
CPU — hantro G1 on RK3566 can sustain ~1.5× realtime for 1080p H.264.

Frame-drop count for the HW waylandsink run is still unknown —
fpsdisplaysink signal-fps-measurements doesn't wire through gst-launch.
Next pass can use GST_DEBUG=fpsdisplaysink:5 or a progressreport element
to get a precise delivered-fps number.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-24 20:18:00 +00:00
parent 1be49951a7
commit 3ef29d615e
+34 -18
View File
@@ -158,30 +158,46 @@ picks up rkvdec2 patches.
6. **Kodi + mpv validation** — 1080p HEVC at <30% of one core target
7. **Document** — freeze the final recipe in this README
**Status 2026-04-24**: ohm reachable; steps 1 + 2 + 3 done for H.264 on
ohm; step 5 scaffold landed (`marfrit-packages/arch/ffmpeg-v4l2-request-git/`
on `main`, commit `9994c4e`, not yet pushed — first fermi CI build + ohm
install still to run). Immediate next: push, build, install, validate.
**Status 2026-04-24**: ohm reachable; steps 1 + 2 + 3 + 4 done for H.264 on
ohm (GStreamer `v4l2slh264dec → waylandsink` at 7 % CPU with zero-copy
dmabuf NV12). Step 5 scaffold landed
(`marfrit-packages/arch/ffmpeg-v4l2-request-git/` on `main`, commit
`9994c4e`, unpushed — fermi CI build + ohm install still to run).
Immediate next: push, build, install, validate the ffmpeg path.
### SW baseline numbers (2026-04-24, ohm, `bbb_1080p30_h264.mp4`)
### Baseline numbers (2026-04-24, ohm, `bbb_1080p30_h264.mp4`)
Clip SHA-16 `dcf8a7170fbd49bb` pulled from doppler
`/moviedata/fourier-test/` via hertz `lxc file pull` → http bridge → ohm
(same bytes on every fleet device).
(same bytes on every fleet device). **Source frame rate is 24 fps** per
H.264 caps (`framerate=24/1`) — the `1080p30` in the file name is a
misnomer carried from Blender's encode metadata; the actual media rate is
24 fps. That's the number the target/delivered columns below are measured
against.
| Test | CPU% | Frames / wall | Effective fps | Dropped |
|----------------------------------------|-------|----------------|---------------|---------|
| `ffmpeg -hwaccel none -f null -` | 319 % | 1440 / 18.55 s | 77.6 | n/a |
| `ffmpeg -re -hwaccel none -f null -` | 90 % | 1440 / 60.27 s | 23.9 (paced) | n/a |
| `mpv --hwdec=no --vo=gpu-next`, DSI-1 | 127 % | 1800 target | ~14 delivered | **973** |
| Test | CPU% | Frames / wall | Effective fps | Dropped |
|--------------------------------------------------------------|-------|-------------------|---------------|---------|
| SW: `ffmpeg -hwaccel none -f null -` | 319 % | 1440 / 18.55 s | 77.6 | n/a |
| SW: `ffmpeg -re -hwaccel none -f null -` | 90 % | 1440 / 60.27 s | 24.0 (paced) | n/a |
| SW: `mpv --hwdec=no --vo=gpu-next`, DSI-1 | 127 % | 1440 source / 60 s| ~7.8 delivered| **973** |
| HW: `gst v4l2slh264dec → fakesink sync=false` | 89 % | 1800 / 48.9 s | 36.8 | n/a |
| HW: `gst v4l2slh264dec → waylandsink` (dmabuf), DSI-1 | 7 % | 1488 source / 62 s| ~24 (paced) | unknown |
Reading: raw SW decode has ~2.6× headroom over realtime on the 4× A55
(≈1 core saturated per 30 fps stream). But mpv with `gpu-next` through
KWin drops **54 %** of frames while only using 127 % CPU — the
compositor / VO path is the bottleneck, not decode. Exactly the
"compositor-bound ≠ decode-bound" gotcha below. HW decode will cut the
decode cost but won't by itself fix the composition drops; `--vo=drm`
(direct KMS scanout) or a dmabuf-capable VO is the lever for that.
Reading:
- SW decode alone has ~3.2× headroom over source rate (77.6 / 24 fps) but
costs ~3.2 cores unpaced or 1 core at realtime pace.
- SW mpv via `gpu-next` through KWin drops **68 %** of frames while only
using 127 % CPU — the compositor / VO path is the bottleneck, not
decode. Exactly the "compositor-bound ≠ decode-bound" gotcha below.
- HW decode to `fakesink` clocks ~36.8 fps (~1.5× source rate), 89 % CPU.
- **HW decode to `waylandsink` via zero-copy dmabuf (`DMA_DRM` NV12)
drops the CPU to 7 %** — an ≈18× drop from the SW mpv number. The
GStreamer `v4l2codecs → waylandsink` path on KWin negotiates
dmabuf-direct, bypassing any GL upload.
- Frame-drop count for the HW waylandsink run is unknown here
(fpsdisplaysink signal wasn't wired through `gst-launch`); next pass
should parse via `GST_DEBUG=fpsdisplaysink:5` or a `progressreport`
element.
### Acceptance criterion