From 2808d486fadf0fd136630b6cc5b270c95d4a9277 Mon Sep 17 00:00:00 2001
From: Markus Fritsche <fritsche.markus@gmail.com>
Date: Fri, 24 Apr 2026 21:25:38 +0000
Subject: [PATCH] ohm step 5 validated: ffmpeg -hwaccel v4l2request + drm_prime
 = 14% CPU
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Installed marfrit/ffmpeg-v4l2-request-git on ohm (pacman -S replaces
stock ffmpeg via provides/conflicts=ffmpeg; yes|pacman past the interactive
conflict prompt). ffmpeg -hwaccels now reports v4l2request alongside
vaapi/drm.

Four timing runs on bbb_1080p30_h264.mp4:

  H1  -hwaccel v4l2request            (unpaced):    37.5 fps, 105% CPU
  H2  -re -hwaccel v4l2request         (realtime):   24.0 fps,  67% CPU
  H3  -hwaccel v4l2request -hwaccel_output_format drm_prime:
                                        (unpaced):   93.5 fps,  51% CPU
  H4  -re + drm_prime                  (realtime):   24.0 fps,  14% CPU

The drm_prime output format is decisive — 5x lower CPU at realtime,
2.5x the peak throughput. Without it, ffmpeg's v4l2request hwaccel maps
every decoded dmabuf back to a CPU-readable NV12 plane before handing
off to the muxer, which blocks the decoder pipeline and burns CPU.

With drm_prime, frames stay as dmabuf handles through the null muxer.
93.5 fps peak means the hantro VPU on RK3566 can sustain ≈95 fps of
1080p H.264 — plenty of headroom for 1080p60, and 4x over 24 fps
source rate.

GStreamer v4l2slh264dec -> waylandsink still wins on CPU (6-7%) because
its dmabuf-direct scanout via KWin->VOP2 avoids every copy. ffmpeg
drm_prime -f null - loses a few % in the null muxer layer, which would
disappear entirely in a decode->drm_prime->waylandsink filter pipeline.

Also discovered during install: DanctNIX's own repo already ships
ffmpeg-v4l2-request 2:8.1-3 (missed in initial recon — I only looked
at the installed "ffmpeg" name, not at package alternatives under
"ffmpeg-*"). The marfrit variant is still the better fit for Fourier
(stripped deps, commit-pinned, our CI control plane) and sorts
strictly newer in vercmp (2:8.1.r123329.b57fbbe-1 > 2:8.1-3).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 README.md | 30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/README.md b/README.md
index bb4286f..032b200 100644
--- a/README.md
+++ b/README.md
@@ -158,14 +158,17 @@ picks up rkvdec2 patches.
 6. **Kodi + mpv validation** — 1080p HEVC at <30% of one core target
 7. **Document** — freeze the final recipe in this README
 
-**Status 2026-04-24**: ohm reachable; steps 1 + 2 + 3 + 4 done for H.264 on
-ohm (GStreamer `v4l2slh264dec → waylandsink` at 7 % CPU with zero-copy
-dmabuf NV12). Step 5 scaffold pushed; fermi CI **built + signed +
-published** `ffmpeg-v4l2-request-git-2:8.1.r123329.b57fbbe-1-aarch64` on
-first push. Immediate next: install on ohm via `pacman -Sy`, validate the
-ffmpeg v4l2-request hwaccel path, compare CPU / fps to the GStreamer
-numbers. Task (b) (Brave HW decode) blocked on multiplanar
-`libva-v4l2-request` rework — see section below.
+**Status 2026-04-24**: ohm reachable; steps 1 + 2 + 3 + 4 + 5 done for
+H.264. GStreamer `v4l2slh264dec → waylandsink` is 6–7 % CPU with
+zero-copy dmabuf NV12. FFmpeg with `-hwaccel v4l2request -hwaccel_output_format
+drm_prime` is 14 % realtime / 93.5 fps peak — hantro's 1080p H.264
+ceiling on RK3566 is ≈95 fps. marfrit `ffmpeg-v4l2-request-git` lives in
+the repo and installs cleanly via `pacman -S`. Task (b) (Brave HW decode)
+blocked on multiplanar `libva-v4l2-request` rework — see section below.
+**Note**: DanctNIX already ships `ffmpeg-v4l2-request 2:8.1-3` in its
+`danctnix` repo; the marfrit build is our stripped variant (no X11/AMF/
+CUDA/etc, Kwiboo tip pinned by commit) and sorts strictly newer in pacman
+vercmp.
 
 ### Baseline numbers (2026-04-24, ohm, `bbb_1080p30_h264.mp4`)
 
@@ -185,6 +188,10 @@ against.
 | HW: `gst v4l2slh264dec → fakesink sync=false`                |  89 % | 1800 / 48.9 s     | 36.8          | n/a     |
 | HW: `gst v4l2slh264dec → waylandsink` (dmabuf), DSI-1 1:1    |   7 % | 1488 / 62 s       | 24.0 (paced)  | 0 (progressreport 1:1) |
 | HW: `gst v4l2slh264dec → waylandsink fullscreen=true`, scaled |  6 % | 1488 / 62 s       | 24.0 (paced)  | 0 (progressreport 1:1) |
+| HW: `ffmpeg -hwaccel v4l2request -f null -`                  | 105 % | 1440 / 38.4 s     | 37.5          | n/a     |
+| HW: `ffmpeg -re -hwaccel v4l2request -f null -`              |  67 % | 1440 / 59.9 s     | 24.0 (paced)  | n/a     |
+| HW: `ffmpeg -hwaccel v4l2request -hwaccel_output_format drm_prime -f null -` | 51 % | 1440 / 15.4 s | **93.5**  | n/a |
+| HW: `ffmpeg -re + drm_prime -f null -`                       |  **14 %** | 1440 / 59.9 s | 24.0 (paced)  | n/a |
 
 Reading:
 - SW decode alone has ~3.2× headroom over source rate (77.6 / 24 fps) but
@@ -204,6 +211,13 @@ Reading:
 - Frame-drop count validated by `progressreport update-freq=5`: stream
   position advances 1:1 with wall clock for the full 62 s run — zero
   drops, full 24 fps delivery.
+- **FFmpeg path needs `-hwaccel_output_format drm_prime`** to avoid the
+  default CPU readback. Without it, `-hwaccel v4l2request` costs 67 % CPU
+  at realtime 24 fps; with it, 14 %. Peak throughput 93.5 fps vs 37.5 fps.
+  The hantro VPU's real 1080p H.264 ceiling on RK3566 is ≈95 fps — enough
+  for 1080p60 with headroom. GStreamer's `v4l2codecs → waylandsink` path
+  still wins on CPU (6–7 %) because its dmabuf-direct scanout avoids every
+  copy; ffmpeg's `null` muxer costs a few percent even with drm_prime.
 
 ### Browser HW decode (Brave / Chromium) — partially wired, library-blocked