ohm step 4: HW decode proof via gst v4l2slh264dec — 7% CPU on Wayland
GStreamer v4l2codecs path validated on ohm: HW H.264 decode plus zero-copy dmabuf NV12 direct to waylandsink via KWin on DSI-1. 62 s playback, 7% CPU total (≈4 s of user+sys over 62 s wall), no pipeline errors. Negotiated caps: video/x-raw(memory:DMABuf), format=DMA_DRM, drm-format=NV12, width=1920, height=1080, framerate=24/1 — so the hantro-vpu decoder emits dmabufs that waylandsink imports directly. That explains the massive CPU drop vs the SW mpv baseline (127 % with 973 / 1440 frames dropped at gpu-next). Side finding: the bbb_1080p30_h264.mp4 clip on doppler is actually 24 fps per the SPS caps — the "1080p30" in the filename is Blender's encode metadata, not the real rate. Recalibrated all baseline rows against 24 fps source. SW mpv drop number revised: 973 of 1440 → 68 %, not 54 % I wrote earlier. Also added an HW throughput row (fakesink sync=false): 36.8 fps, 89 % CPU — hantro G1 on RK3566 can sustain ~1.5× realtime for 1080p H.264. Frame-drop count for the HW waylandsink run is still unknown — fpsdisplaysink signal-fps-measurements doesn't wire through gst-launch. Next pass can use GST_DEBUG=fpsdisplaysink:5 or a progressreport element to get a precise delivered-fps number. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -158,30 +158,46 @@ picks up rkvdec2 patches.
|
||||
6. **Kodi + mpv validation** — 1080p HEVC at <30% of one core target
|
||||
7. **Document** — freeze the final recipe in this README
|
||||
|
||||
**Status 2026-04-24**: ohm reachable; steps 1 + 2 + 3 done for H.264 on
|
||||
ohm; step 5 scaffold landed (`marfrit-packages/arch/ffmpeg-v4l2-request-git/`
|
||||
on `main`, commit `9994c4e`, not yet pushed — first fermi CI build + ohm
|
||||
install still to run). Immediate next: push, build, install, validate.
|
||||
**Status 2026-04-24**: ohm reachable; steps 1 + 2 + 3 + 4 done for H.264 on
|
||||
ohm (GStreamer `v4l2slh264dec → waylandsink` at 7 % CPU with zero-copy
|
||||
dmabuf NV12). Step 5 scaffold landed
|
||||
(`marfrit-packages/arch/ffmpeg-v4l2-request-git/` on `main`, commit
|
||||
`9994c4e`, unpushed — fermi CI build + ohm install still to run).
|
||||
Immediate next: push, build, install, validate the ffmpeg path.
|
||||
|
||||
### SW baseline numbers (2026-04-24, ohm, `bbb_1080p30_h264.mp4`)
|
||||
### Baseline numbers (2026-04-24, ohm, `bbb_1080p30_h264.mp4`)
|
||||
|
||||
Clip SHA-16 `dcf8a7170fbd49bb` pulled from doppler
|
||||
`/moviedata/fourier-test/` via hertz `lxc file pull` → http bridge → ohm
|
||||
(same bytes on every fleet device).
|
||||
(same bytes on every fleet device). **Source frame rate is 24 fps** per
|
||||
H.264 caps (`framerate=24/1`) — the `1080p30` in the file name is a
|
||||
misnomer carried from Blender's encode metadata; the actual media rate is
|
||||
24 fps. That's the number the target/delivered columns below are measured
|
||||
against.
|
||||
|
||||
| Test | CPU% | Frames / wall | Effective fps | Dropped |
|
||||
|----------------------------------------|-------|----------------|---------------|---------|
|
||||
| `ffmpeg -hwaccel none -f null -` | 319 % | 1440 / 18.55 s | 77.6 | n/a |
|
||||
| `ffmpeg -re -hwaccel none -f null -` | 90 % | 1440 / 60.27 s | 23.9 (paced) | n/a |
|
||||
| `mpv --hwdec=no --vo=gpu-next`, DSI-1 | 127 % | 1800 target | ~14 delivered | **973** |
|
||||
| Test | CPU% | Frames / wall | Effective fps | Dropped |
|
||||
|--------------------------------------------------------------|-------|-------------------|---------------|---------|
|
||||
| SW: `ffmpeg -hwaccel none -f null -` | 319 % | 1440 / 18.55 s | 77.6 | n/a |
|
||||
| SW: `ffmpeg -re -hwaccel none -f null -` | 90 % | 1440 / 60.27 s | 24.0 (paced) | n/a |
|
||||
| SW: `mpv --hwdec=no --vo=gpu-next`, DSI-1 | 127 % | 1440 source / 60 s| ~7.8 delivered| **973** |
|
||||
| HW: `gst v4l2slh264dec → fakesink sync=false` | 89 % | 1800 / 48.9 s | 36.8 | n/a |
|
||||
| HW: `gst v4l2slh264dec → waylandsink` (dmabuf), DSI-1 | 7 % | 1488 source / 62 s| ~24 (paced) | unknown |
|
||||
|
||||
Reading: raw SW decode has ~2.6× headroom over realtime on the 4× A55
|
||||
(≈1 core saturated per 30 fps stream). But mpv with `gpu-next` through
|
||||
KWin drops **54 %** of frames while only using 127 % CPU — the
|
||||
compositor / VO path is the bottleneck, not decode. Exactly the
|
||||
"compositor-bound ≠ decode-bound" gotcha below. HW decode will cut the
|
||||
decode cost but won't by itself fix the composition drops; `--vo=drm`
|
||||
(direct KMS scanout) or a dmabuf-capable VO is the lever for that.
|
||||
Reading:
|
||||
- SW decode alone has ~3.2× headroom over source rate (77.6 / 24 fps) but
|
||||
costs ~3.2 cores unpaced or 1 core at realtime pace.
|
||||
- SW mpv via `gpu-next` through KWin drops **68 %** of frames while only
|
||||
using 127 % CPU — the compositor / VO path is the bottleneck, not
|
||||
decode. Exactly the "compositor-bound ≠ decode-bound" gotcha below.
|
||||
- HW decode to `fakesink` clocks ~36.8 fps (~1.5× source rate), 89 % CPU.
|
||||
- **HW decode to `waylandsink` via zero-copy dmabuf (`DMA_DRM` NV12)
|
||||
drops the CPU to 7 %** — an ≈18× drop from the SW mpv number. The
|
||||
GStreamer `v4l2codecs → waylandsink` path on KWin negotiates
|
||||
dmabuf-direct, bypassing any GL upload.
|
||||
- Frame-drop count for the HW waylandsink run is unknown here
|
||||
(fpsdisplaysink signal wasn't wired through `gst-launch`); next pass
|
||||
should parse via `GST_DEBUG=fpsdisplaysink:5` or a `progressreport`
|
||||
element.
|
||||
|
||||
### Acceptance criterion
|
||||
|
||||
|
||||
Reference in New Issue
Block a user