# Performance comparison: pure FFmpeg vs daedalus pipeline **Date:** 2026-05-18, measured on hertz (Raspberry Pi 5, kernel 6.12.75+rpt-rpi-2712, FFmpeg 7.1.3). This compares **CPU load and wall time** for two decode paths of the same 1080p bitstream: 1. **Pure FFmpeg software** — `ffmpeg -i in.ivf -pix_fmt nv12 -f rawvideo -y /dev/null`. Uses FFmpeg's default thread count (2 on this Pi 5). 2. **Daedalus pipeline** — `tools/test_m2m_stream` feeds each frame through `/dev/video0` → daedalus_v4l2 kernel module → /dev/daedalus-v4l2 chardev → daedalus daemon → FFmpeg software decode (single-threaded) → dmabuf → CAPTURE buffer → `/dev/null`. Both paths use the same FFmpeg library underneath; the "daedalus" path adds the V4L2 m2m + chardev + dmabuf marshalling overhead but constrains decode to a single thread (the daemon's serialised event loop). No QPU substitution is involved — that was explicitly deferred in Phase 8.8 once measurement showed the 30fps@1080p target was already met. ## Methodology `tools/cpu_bench.py` (under `/tmp/` on hertz): - For each codec, decode a 150-frame 1080p `testsrc` stream once (decode-as-fast-as-possible, no rate limit). - Measure wall ns via `time.monotonic_ns()`. - Measure command CPU via `os.wait4` rusage (`ru_utime + ru_stime`). - Measure daemon CPU separately by diffing `/proc//stat` fields 14 + 15 (utime + stime) before and after. - Total CPU = command + daemon (the latter is zero for the pure-ffmpeg path). - Load = total_cpu / wall × 100 % (so 100 % = one core busy for the full wall duration; 200 % = two cores). ## Results ### 1080p × 150 frames, decode-as-fast-as-possible | Codec | Path | Wall | CPU time | Load | Cores | |-------|------|------|----------|------|-------| | **VP9 8-bit** | pure ffmpeg | 1083 ms | 2327 ms | **214.9 %** | ~2.1 | | | daedalus | 1229 ms | 1183 ms | **96.3 %** | ~1.0 | | **AV1 8-bit** | pure ffmpeg | 1162 ms | 2342 ms | **201.5 %** | ~2.0 | | | daedalus | 1478 ms | 1428 ms | **96.6 %** | ~1.0 | | **H.264 8-bit** | pure ffmpeg | 1063 ms | 2188 ms | **205.8 %** | ~2.1 | | | daedalus | 1020 ms | 1021 ms | **100.1 %** | ~1.0 | | **VP9 10-bit (P010)** | pure ffmpeg | 269 ms | 419 ms | **155.8 %** | ~1.6 | | | daedalus | 131 ms | 120 ms | **91.6 %** | ~1.0 | ## Interpretation ### Daedalus uses ~half the CPU for the same throughput FFmpeg standalone defaults to `-threads 0` → 2 threads on this 4-core Pi 5. Single-stream decode doesn't parallelise well across threads — the wall time barely changes vs single-thread (FFmpeg's frame-level threading helps less than slice-level threading, and is mostly waiting on ref-frame dependencies). Result: 2× the CPU usage for roughly the same throughput. The daedalus path's daemon is **single-threaded by design**. It serialises REQ_DECODE handling through one chardev I/O loop, calling FFmpeg's decode with no thread fan-out. That costs us ~one core but no more. ### Wall time is comparable H.264 is actually faster wallclock through daedalus (1020 ms vs 1063 ms). VP9 and AV1 are 10–30 % slower wallclock — the cost of dmabuf-fd allocation per frame + chardev round-trip + the daemon's NV12 pack loop. This is the architectural overhead measurement we wanted. For VP9-10bit the daedalus path is *much* faster wallclock (131 ms vs 269 ms) — at small per-frame work the FFmpeg thread-pool spin-up overhead dominates. ### At sustained 30fps playback Both paths fit comfortably inside a 33 ms-per-frame budget (largest per-frame wall ≈ 9.9 ms for daedalus AV1). CPU load *during decode bursts*: | Path | CPU during a decode | CPU averaged over 30fps | |------|---------------------|-------------------------| | Pure ffmpeg | 200 % (across 2 cores) | ~12 % of system (4-core) | | Daedalus | 100 % (1 core) | ~6 % of system (4-core) | At the project's user-facing target ("daily YouTube playback with CPU free for vscode" — see the `30fps-floor-is-fine` memory), daedalus leaves roughly 2× the CPU headroom for the rest of the desktop. ### Why isn't this "QPU offloading"? Phase 8.8 explicitly **didn't** wire QPU dispatch into the decode path. Measurement showed the daemon's FFmpeg software path already exceeded the 30fps@1080p target by 2-3× — QPU substitution would have been premature optimisation. The "daedalus pipeline" benefit here comes from **architectural choices**, not QPU offload: 1. **Single-threaded decode**: avoids FFmpeg's multi-thread coordination tax on small per-stream workloads. 2. **Out-of-process daemon**: decoding doesn't share address space with the V4L2 client; CPU spikes don't compete for the client's working set. 3. **dmabuf zero-copy**: decoded pixels land directly in the client's CAPTURE buffer — no per-frame memcpy at the V4L2 boundary. The QPU substitution work stays on the roadmap (Phase 8.10+) for higher-resolution / higher-fps / lower-power workloads where the FFmpeg software path doesn't have enough headroom. ## Caveats - Tests are decode-as-fast-as-possible, not real-time playback. Real consumers would pace to display refresh rate and the CPU-saving advantage would amplify (idle gaps between frames are pure savings). - Daemon's `daemon_cpu_ms` includes its own FFmpeg decode plus chardev I/O plus NV12 pack — they're not separable from `/proc/PID/stat`. - 4-core Pi 5; results scale differently on 1- or 2-core machines (the multi-thread overhead becomes proportionally worse). - All tests use the same `testsrc` 1080p source; real content (high-motion, large GOPs) would shift the per-frame µs but the architectural ratio should hold. ## Reproduce ```sh # On hertz, with the kernel module loaded and the daemon running: python3 /tmp/cpu_bench.py ``` `/tmp/cpu_bench.py` (the harness used here) reads `/proc//stat` for the daemon-side CPU and uses `os.wait4` rusage for the client-side CPU. Inputs: `/tmp/{vp9,av1,h264}_5s.ivf` (1080p 30fps 150 frames each), generated by: ```sh ffmpeg -f lavfi -i 'testsrc=duration=5:size=1920x1080:rate=30' \ -pix_fmt yuv420p -c:v libvpx-vp9 -cpu-used 8 -y vp9_5s.ivf ffmpeg -f lavfi -i 'testsrc=duration=5:size=1920x1080:rate=30' \ -pix_fmt yuv420p -c:v libsvtav1 -preset 12 -y av1_5s.ivf ffmpeg -f lavfi -i 'testsrc=duration=5:size=1920x1080:rate=30' \ -pix_fmt yuv420p -c:v libx264 -preset ultrafast \ -profile:v baseline -y h264_5s.h264 ```