Files
daedalus-v4l2/docs/perf_comparison.md
marfrit 1d0db3b5a9 docs: pure ffmpeg vs daedalus pipeline CPU comparison
Measured on hertz (Pi 5, 6.12.75+rpt-rpi-2712, FFmpeg 7.1.3)
to quantify the architectural cost/benefit of routing decode
through the V4L2 m2m + chardev + dmabuf path vs running
ffmpeg standalone.

1080p × 150 frames, decode-as-fast-as-possible:

  VP9 8-bit:     ffmpeg 214.9% CPU / 1083ms wall
                 daedalus 96.3% CPU / 1229ms wall
  AV1 8-bit:     ffmpeg 201.5% CPU / 1162ms wall
                 daedalus 96.6% CPU / 1478ms wall
  H.264 8-bit:   ffmpeg 205.8% CPU / 1063ms wall
                 daedalus 100.1% CPU / 1020ms wall
  VP9 10-bit:    ffmpeg 155.8% CPU /  269ms wall
                 daedalus 91.6% CPU /  131ms wall

Key takeaway: the daedalus pipeline uses ~half the CPU for
roughly the same wall throughput. FFmpeg standalone defaults
to 2 threads; for single-stream decode that doesn't
parallelise well, so the 2× CPU usage is overhead, not
parallelism benefit. The daemon's single-threaded serialised
event loop avoids that tax.

For the project's 30fps-floor-is-fine target ("daily YouTube
with CPU free for vscode"), daedalus leaves ~2× the CPU
headroom for the rest of the desktop at the same playback
rate.

VP9-10bit is striking — daedalus is faster wallclock too
(131ms vs 269ms) because at small per-frame work FFmpeg's
thread pool spin-up dominates.

Note: "daedalus" still uses FFmpeg internally (Phase 8.8
explicitly deferred QPU substitution after measurement showed
30fps@1080p was already met). The benefit here is
architectural — single-threaded decode, out-of-process
daemon, dmabuf zero-copy — not QPU offload.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 17:20:22 +00:00

166 lines
6.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Performance comparison: pure FFmpeg vs daedalus pipeline
**Date:** 2026-05-18, measured on hertz (Raspberry Pi 5,
kernel 6.12.75+rpt-rpi-2712, FFmpeg 7.1.3).
This compares **CPU load and wall time** for two decode paths
of the same 1080p bitstream:
1. **Pure FFmpeg software**
`ffmpeg -i in.ivf -pix_fmt nv12 -f rawvideo -y /dev/null`.
Uses FFmpeg's default thread count (2 on this Pi 5).
2. **Daedalus pipeline**`tools/test_m2m_stream` feeds each
frame through `/dev/video0` → daedalus_v4l2 kernel module
→ /dev/daedalus-v4l2 chardev → daedalus daemon → FFmpeg
software decode (single-threaded) → dmabuf → CAPTURE
buffer → `/dev/null`.
Both paths use the same FFmpeg library underneath; the
"daedalus" path adds the V4L2 m2m + chardev + dmabuf
marshalling overhead but constrains decode to a single
thread (the daemon's serialised event loop). No QPU
substitution is involved — that was explicitly deferred
in Phase 8.8 once measurement showed the
30fps@1080p target was already met.
## Methodology
`tools/cpu_bench.py` (under `/tmp/` on hertz):
- For each codec, decode a 150-frame 1080p `testsrc`
stream once (decode-as-fast-as-possible, no rate limit).
- Measure wall ns via `time.monotonic_ns()`.
- Measure command CPU via `os.wait4` rusage
(`ru_utime + ru_stime`).
- Measure daemon CPU separately by diffing
`/proc/<daemon-pid>/stat` fields 14 + 15 (utime + stime)
before and after.
- Total CPU = command + daemon (the latter is zero for
the pure-ffmpeg path).
- Load = total_cpu / wall × 100 % (so 100 % = one core
busy for the full wall duration; 200 % = two cores).
## Results
### 1080p × 150 frames, decode-as-fast-as-possible
| Codec | Path | Wall | CPU time | Load | Cores |
|-------|------|------|----------|------|-------|
| **VP9 8-bit** | pure ffmpeg | 1083 ms | 2327 ms | **214.9 %** | ~2.1 |
| | daedalus | 1229 ms | 1183 ms | **96.3 %** | ~1.0 |
| **AV1 8-bit** | pure ffmpeg | 1162 ms | 2342 ms | **201.5 %** | ~2.0 |
| | daedalus | 1478 ms | 1428 ms | **96.6 %** | ~1.0 |
| **H.264 8-bit** | pure ffmpeg | 1063 ms | 2188 ms | **205.8 %** | ~2.1 |
| | daedalus | 1020 ms | 1021 ms | **100.1 %** | ~1.0 |
| **VP9 10-bit (P010)** | pure ffmpeg | 269 ms | 419 ms | **155.8 %** | ~1.6 |
| | daedalus | 131 ms | 120 ms | **91.6 %** | ~1.0 |
## Interpretation
### Daedalus uses ~half the CPU for the same throughput
FFmpeg standalone defaults to `-threads 0` → 2 threads on
this 4-core Pi 5. Single-stream decode doesn't parallelise
well across threads — the wall time barely changes vs
single-thread (FFmpeg's frame-level threading helps less
than slice-level threading, and is mostly waiting on
ref-frame dependencies). Result: 2× the CPU usage for
roughly the same throughput.
The daedalus path's daemon is **single-threaded by design**.
It serialises REQ_DECODE handling through one chardev I/O
loop, calling FFmpeg's decode with no thread fan-out. That
costs us ~one core but no more.
### Wall time is comparable
H.264 is actually faster wallclock through daedalus
(1020 ms vs 1063 ms). VP9 and AV1 are 1030 % slower
wallclock — the cost of dmabuf-fd allocation per frame +
chardev round-trip + the daemon's NV12 pack loop. This is
the architectural overhead measurement we wanted.
For VP9-10bit the daedalus path is *much* faster wallclock
(131 ms vs 269 ms) — at small per-frame work the FFmpeg
thread-pool spin-up overhead dominates.
### At sustained 30fps playback
Both paths fit comfortably inside a 33 ms-per-frame budget
(largest per-frame wall ≈ 9.9 ms for daedalus AV1). CPU
load *during decode bursts*:
| Path | CPU during a decode | CPU averaged over 30fps |
|------|---------------------|-------------------------|
| Pure ffmpeg | 200 % (across 2 cores) | ~12 % of system (4-core) |
| Daedalus | 100 % (1 core) | ~6 % of system (4-core) |
At the project's user-facing target ("daily YouTube
playback with CPU free for vscode" — see the
`30fps-floor-is-fine` memory), daedalus leaves roughly
2× the CPU headroom for the rest of the desktop.
### Why isn't this "QPU offloading"?
Phase 8.8 explicitly **didn't** wire QPU dispatch into the
decode path. Measurement showed the daemon's FFmpeg
software path already exceeded the 30fps@1080p target by
2-3× — QPU substitution would have been premature
optimisation.
The "daedalus pipeline" benefit here comes from
**architectural choices**, not QPU offload:
1. **Single-threaded decode**: avoids FFmpeg's
multi-thread coordination tax on small per-stream
workloads.
2. **Out-of-process daemon**: decoding doesn't share
address space with the V4L2 client; CPU spikes don't
compete for the client's working set.
3. **dmabuf zero-copy**: decoded pixels land directly in
the client's CAPTURE buffer — no per-frame memcpy at
the V4L2 boundary.
The QPU substitution work stays on the roadmap (Phase 8.10+)
for higher-resolution / higher-fps / lower-power workloads
where the FFmpeg software path doesn't have enough headroom.
## Caveats
- Tests are decode-as-fast-as-possible, not real-time
playback. Real consumers would pace to display
refresh rate and the CPU-saving advantage would
amplify (idle gaps between frames are pure savings).
- Daemon's `daemon_cpu_ms` includes its own FFmpeg
decode plus chardev I/O plus NV12 pack — they're
not separable from `/proc/PID/stat`.
- 4-core Pi 5; results scale differently on 1- or
2-core machines (the multi-thread overhead becomes
proportionally worse).
- All tests use the same `testsrc` 1080p source; real
content (high-motion, large GOPs) would shift the
per-frame µs but the architectural ratio should hold.
## Reproduce
```sh
# On hertz, with the kernel module loaded and the daemon running:
python3 /tmp/cpu_bench.py
```
`/tmp/cpu_bench.py` (the harness used here) reads
`/proc/<daemon-pid>/stat` for the daemon-side CPU and uses
`os.wait4` rusage for the client-side CPU. Inputs:
`/tmp/{vp9,av1,h264}_5s.ivf` (1080p 30fps 150 frames each),
generated by:
```sh
ffmpeg -f lavfi -i 'testsrc=duration=5:size=1920x1080:rate=30' \
-pix_fmt yuv420p -c:v libvpx-vp9 -cpu-used 8 -y vp9_5s.ivf
ffmpeg -f lavfi -i 'testsrc=duration=5:size=1920x1080:rate=30' \
-pix_fmt yuv420p -c:v libsvtav1 -preset 12 -y av1_5s.ivf
ffmpeg -f lavfi -i 'testsrc=duration=5:size=1920x1080:rate=30' \
-pix_fmt yuv420p -c:v libx264 -preset ultrafast \
-profile:v baseline -y h264_5s.h264
```