diff --git a/docs/perf_comparison.md b/docs/perf_comparison.md
new file mode 100644
index 0000000..f9f3aa5
--- /dev/null
+++ b/docs/perf_comparison.md
@@ -0,0 +1,165 @@
+# Performance comparison: pure FFmpeg vs daedalus pipeline
+
+**Date:** 2026-05-18, measured on hertz (Raspberry Pi 5,
+kernel 6.12.75+rpt-rpi-2712, FFmpeg 7.1.3).
+
+This compares **CPU load and wall time** for two decode paths
+of the same 1080p bitstream:
+
+1. **Pure FFmpeg software** —
+   `ffmpeg -i in.ivf -pix_fmt nv12 -f rawvideo -y /dev/null`.
+   Uses FFmpeg's default thread count (2 on this Pi 5).
+2. **Daedalus pipeline** — `tools/test_m2m_stream` feeds each
+   frame through `/dev/video0` → daedalus_v4l2 kernel module
+   → /dev/daedalus-v4l2 chardev → daedalus daemon → FFmpeg
+   software decode (single-threaded) → dmabuf → CAPTURE
+   buffer → `/dev/null`.
+
+Both paths use the same FFmpeg library underneath; the
+"daedalus" path adds the V4L2 m2m + chardev + dmabuf
+marshalling overhead but constrains decode to a single
+thread (the daemon's serialised event loop).  No QPU
+substitution is involved — that was explicitly deferred
+in Phase 8.8 once measurement showed the
+30fps@1080p target was already met.
+
+## Methodology
+
+`tools/cpu_bench.py` (under `/tmp/` on hertz):
+
+- For each codec, decode a 150-frame 1080p `testsrc`
+  stream once (decode-as-fast-as-possible, no rate limit).
+- Measure wall ns via `time.monotonic_ns()`.
+- Measure command CPU via `os.wait4` rusage
+  (`ru_utime + ru_stime`).
+- Measure daemon CPU separately by diffing
+  `/proc/<daemon-pid>/stat` fields 14 + 15 (utime + stime)
+  before and after.
+- Total CPU = command + daemon (the latter is zero for
+  the pure-ffmpeg path).
+- Load = total_cpu / wall × 100 % (so 100 % = one core
+  busy for the full wall duration; 200 % = two cores).
+
+## Results
+
+### 1080p × 150 frames, decode-as-fast-as-possible
+
+| Codec | Path | Wall | CPU time | Load | Cores |
+|-------|------|------|----------|------|-------|
+| **VP9 8-bit**   | pure ffmpeg | 1083 ms | 2327 ms | **214.9 %** | ~2.1 |
+|                 | daedalus    | 1229 ms | 1183 ms |  **96.3 %** | ~1.0 |
+| **AV1 8-bit**   | pure ffmpeg | 1162 ms | 2342 ms | **201.5 %** | ~2.0 |
+|                 | daedalus    | 1478 ms | 1428 ms |  **96.6 %** | ~1.0 |
+| **H.264 8-bit** | pure ffmpeg | 1063 ms | 2188 ms | **205.8 %** | ~2.1 |
+|                 | daedalus    | 1020 ms | 1021 ms | **100.1 %** | ~1.0 |
+| **VP9 10-bit (P010)** | pure ffmpeg |  269 ms | 419 ms | **155.8 %** | ~1.6 |
+|                 | daedalus    |  131 ms | 120 ms |  **91.6 %** | ~1.0 |
+
+## Interpretation
+
+### Daedalus uses ~half the CPU for the same throughput
+
+FFmpeg standalone defaults to `-threads 0` → 2 threads on
+this 4-core Pi 5.  Single-stream decode doesn't parallelise
+well across threads — the wall time barely changes vs
+single-thread (FFmpeg's frame-level threading helps less
+than slice-level threading, and is mostly waiting on
+ref-frame dependencies).  Result: 2× the CPU usage for
+roughly the same throughput.
+
+The daedalus path's daemon is **single-threaded by design**.
+It serialises REQ_DECODE handling through one chardev I/O
+loop, calling FFmpeg's decode with no thread fan-out.  That
+costs us ~one core but no more.
+
+### Wall time is comparable
+
+H.264 is actually faster wallclock through daedalus
+(1020 ms vs 1063 ms).  VP9 and AV1 are 10–30 % slower
+wallclock — the cost of dmabuf-fd allocation per frame +
+chardev round-trip + the daemon's NV12 pack loop.  This is
+the architectural overhead measurement we wanted.
+
+For VP9-10bit the daedalus path is *much* faster wallclock
+(131 ms vs 269 ms) — at small per-frame work the FFmpeg
+thread-pool spin-up overhead dominates.
+
+### At sustained 30fps playback
+
+Both paths fit comfortably inside a 33 ms-per-frame budget
+(largest per-frame wall ≈ 9.9 ms for daedalus AV1).  CPU
+load *during decode bursts*:
+
+| Path | CPU during a decode | CPU averaged over 30fps |
+|------|---------------------|-------------------------|
+| Pure ffmpeg | 200 % (across 2 cores) | ~12 % of system (4-core) |
+| Daedalus    | 100 % (1 core)         | ~6 % of system (4-core)  |
+
+At the project's user-facing target ("daily YouTube
+playback with CPU free for vscode" — see the
+`30fps-floor-is-fine` memory), daedalus leaves roughly
+2× the CPU headroom for the rest of the desktop.
+
+### Why isn't this "QPU offloading"?
+
+Phase 8.8 explicitly **didn't** wire QPU dispatch into the
+decode path.  Measurement showed the daemon's FFmpeg
+software path already exceeded the 30fps@1080p target by
+2-3× — QPU substitution would have been premature
+optimisation.
+
+The "daedalus pipeline" benefit here comes from
+**architectural choices**, not QPU offload:
+
+1. **Single-threaded decode**: avoids FFmpeg's
+   multi-thread coordination tax on small per-stream
+   workloads.
+2. **Out-of-process daemon**: decoding doesn't share
+   address space with the V4L2 client; CPU spikes don't
+   compete for the client's working set.
+3. **dmabuf zero-copy**: decoded pixels land directly in
+   the client's CAPTURE buffer — no per-frame memcpy at
+   the V4L2 boundary.
+
+The QPU substitution work stays on the roadmap (Phase 8.10+)
+for higher-resolution / higher-fps / lower-power workloads
+where the FFmpeg software path doesn't have enough headroom.
+
+## Caveats
+
+- Tests are decode-as-fast-as-possible, not real-time
+  playback.  Real consumers would pace to display
+  refresh rate and the CPU-saving advantage would
+  amplify (idle gaps between frames are pure savings).
+- Daemon's `daemon_cpu_ms` includes its own FFmpeg
+  decode plus chardev I/O plus NV12 pack — they're
+  not separable from `/proc/PID/stat`.
+- 4-core Pi 5; results scale differently on 1- or
+  2-core machines (the multi-thread overhead becomes
+  proportionally worse).
+- All tests use the same `testsrc` 1080p source; real
+  content (high-motion, large GOPs) would shift the
+  per-frame µs but the architectural ratio should hold.
+
+## Reproduce
+
+```sh
+# On hertz, with the kernel module loaded and the daemon running:
+python3 /tmp/cpu_bench.py
+```
+
+`/tmp/cpu_bench.py` (the harness used here) reads
+`/proc/<daemon-pid>/stat` for the daemon-side CPU and uses
+`os.wait4` rusage for the client-side CPU.  Inputs:
+`/tmp/{vp9,av1,h264}_5s.ivf` (1080p 30fps 150 frames each),
+generated by:
+
+```sh
+ffmpeg -f lavfi -i 'testsrc=duration=5:size=1920x1080:rate=30' \
+       -pix_fmt yuv420p -c:v libvpx-vp9 -cpu-used 8 -y vp9_5s.ivf
+ffmpeg -f lavfi -i 'testsrc=duration=5:size=1920x1080:rate=30' \
+       -pix_fmt yuv420p -c:v libsvtav1 -preset 12 -y av1_5s.ivf
+ffmpeg -f lavfi -i 'testsrc=duration=5:size=1920x1080:rate=30' \
+       -pix_fmt yuv420p -c:v libx264 -preset ultrafast \
+       -profile:v baseline -y h264_5s.h264
+```