ohm step 4: HW decode proof via gst v4l2slh264dec — 7% CPU on Wayland

GStreamer v4l2codecs path validated on ohm: HW H.264 decode plus zero-copy dmabuf NV12 direct to waylandsink via KWin on DSI-1. 62 s playback, 7% CPU total (≈4 s of user+sys over 62 s wall), no pipeline errors. Negotiated caps: video/x-raw(memory:DMABuf), format=DMA_DRM, drm-format=NV12, width=1920, height=1080, framerate=24/1 — so the hantro-vpu decoder emits dmabufs that waylandsink imports directly. That explains the massive CPU drop vs the SW mpv baseline (127 % with 973 / 1440 frames dropped at gpu-next). Side finding: the bbb_1080p30_h264.mp4 clip on doppler is actually 24 fps per the SPS caps — the "1080p30" in the filename is Blender's encode metadata, not the real rate. Recalibrated all baseline rows against 24 fps source. SW mpv drop number revised: 973 of 1440 → 68 %, not 54 % I wrote earlier. Also added an HW throughput row (fakesink sync=false): 36.8 fps, 89 % CPU — hantro G1 on RK3566 can sustain ~1.5× realtime for 1080p H.264. Frame-drop count for the HW waylandsink run is still unknown — fpsdisplaysink signal-fps-measurements doesn't wire through gst-launch. Next pass can use GST_DEBUG=fpsdisplaysink:5 or a progressreport element to get a precise delivered-fps number. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 20:18:00 +00:00
parent 1be49951a7
commit 3ef29d615e
1 changed files with 34 additions and 18 deletions
@@ -158,30 +158,46 @@ picks up rkvdec2 patches.
 6. **Kodi + mpv validation** — 1080p HEVC at <30% of one core target
 7. **Document** — freeze the final recipe in this README

-**Status 2026-04-24**: ohm reachable; steps 1 + 2 + 3 done for H.264 on
-ohm; step 5 scaffold landed (`marfrit-packages/arch/ffmpeg-v4l2-request-git/`
-on `main`, commit `9994c4e`, not yet pushed — first fermi CI build + ohm
-install still to run). Immediate next: push, build, install, validate.
+**Status 2026-04-24**: ohm reachable; steps 1 + 2 + 3 + 4 done for H.264 on
+ohm (GStreamer `v4l2slh264dec → waylandsink` at 7 % CPU with zero-copy
+dmabuf NV12). Step 5 scaffold landed
+(`marfrit-packages/arch/ffmpeg-v4l2-request-git/` on `main`, commit
+`9994c4e`, unpushed — fermi CI build + ohm install still to run).
+Immediate next: push, build, install, validate the ffmpeg path.

-### SW baseline numbers (2026-04-24, ohm, `bbb_1080p30_h264.mp4`)
+### Baseline numbers (2026-04-24, ohm, `bbb_1080p30_h264.mp4`)

 Clip SHA-16 `dcf8a7170fbd49bb` pulled from doppler
 `/moviedata/fourier-test/` via hertz `lxc file pull` → http bridge → ohm
-(same bytes on every fleet device).
+(same bytes on every fleet device). **Source frame rate is 24 fps** per
+H.264 caps (`framerate=24/1`) — the `1080p30` in the file name is a
+misnomer carried from Blender's encode metadata; the actual media rate is
+24 fps. That's the number the target/delivered columns below are measured
+against.

-| Test                                   | CPU%  | Frames / wall  | Effective fps | Dropped |
-|----------------------------------------|-------|----------------|---------------|---------|
-| `ffmpeg -hwaccel none -f null -`       | 319 % | 1440 / 18.55 s | 77.6          | n/a     |
-| `ffmpeg -re -hwaccel none -f null -`   |  90 % | 1440 / 60.27 s | 23.9 (paced)  | n/a     |
-| `mpv --hwdec=no --vo=gpu-next`, DSI-1  | 127 % | 1800 target    | ~14 delivered | **973** |
+| Test                                                         | CPU%  | Frames / wall     | Effective fps | Dropped |
+|--------------------------------------------------------------|-------|-------------------|---------------|---------|
+| SW: `ffmpeg -hwaccel none -f null -`                         | 319 % | 1440 / 18.55 s    | 77.6          | n/a     |
+| SW: `ffmpeg -re -hwaccel none -f null -`                     |  90 % | 1440 / 60.27 s    | 24.0 (paced)  | n/a     |
+| SW: `mpv --hwdec=no --vo=gpu-next`, DSI-1                    | 127 % | 1440 source / 60 s| ~7.8 delivered| **973** |
+| HW: `gst v4l2slh264dec → fakesink sync=false`                |  89 % | 1800 / 48.9 s     | 36.8          | n/a     |
+| HW: `gst v4l2slh264dec → waylandsink` (dmabuf), DSI-1        |   7 % | 1488 source / 62 s| ~24 (paced)   | unknown |

-Reading: raw SW decode has ~2.6× headroom over realtime on the 4× A55
-(≈1 core saturated per 30 fps stream). But mpv with `gpu-next` through
-KWin drops **54 %** of frames while only using 127 % CPU — the
-compositor / VO path is the bottleneck, not decode. Exactly the
-"compositor-bound ≠ decode-bound" gotcha below. HW decode will cut the
-decode cost but won't by itself fix the composition drops; `--vo=drm`
-(direct KMS scanout) or a dmabuf-capable VO is the lever for that.
+Reading:
+- SW decode alone has ~3.2× headroom over source rate (77.6 / 24 fps) but
+  costs ~3.2 cores unpaced or 1 core at realtime pace.
+- SW mpv via `gpu-next` through KWin drops **68 %** of frames while only
+  using 127 % CPU — the compositor / VO path is the bottleneck, not
+  decode. Exactly the "compositor-bound ≠ decode-bound" gotcha below.
+- HW decode to `fakesink` clocks ~36.8 fps (~1.5× source rate), 89 % CPU.
+- **HW decode to `waylandsink` via zero-copy dmabuf (`DMA_DRM` NV12)
+  drops the CPU to 7 %** — an ≈18× drop from the SW mpv number. The
+  GStreamer `v4l2codecs → waylandsink` path on KWin negotiates
+  dmabuf-direct, bypassing any GL upload.
+- Frame-drop count for the HW waylandsink run is unknown here
+  (fpsdisplaysink signal wasn't wired through `gst-launch`); next pass
+  should parse via `GST_DEBUG=fpsdisplaysink:5` or a `progressreport`
+  element.

 ### Acceptance criterion