Files
dmabuf-modifier-triage/phase2_iter1_findings.md
marfrit 54fb20bcc0 iter1 phase 4: H7 confirmed by panfrost kernel source-read
Read Linux 6.12 panfrost source. Smoking gun chain:

(a) panfrost_gem.c:262 — obj->base.map_wc = !pfdev->coherent
    Imports get write-combine (uncached) CPU mapping if non-coherent.

(b) panfrost_drv.c:625 — pfdev->coherent comes from DT dma-coherent.
    On ohm: NO dma-coherent in /sys/firmware/devicetree/base/.
    So pfdev->coherent = false.

(c) panfrost_mmu.c:330 — int prot = IOMMU_READ | IOMMU_WRITE
    NO IOMMU_CACHE. Mali's IOMMU mapping is non-snooping.
    Mali reads directly from DRAM, bypassing CPU caches.

(d) KWin caches EGL_images per-fd (eglbackend.cpp:282).
    Cache sync only at dma_buf_map_attachment time (one-time).

(e) V4L2 doesn't attach dma_resv fences to CAPTURE buffers on DQBUF.
    No per-frame cache flush trigger.

Architectural picture: hantro writes through CPU L1/L2/L3 caches,
Mali reads through non-snooping IOMMU, sees stale/zero DRAM. Result:
green frames.

Counter-validation: --vo=gpu (CPU-mmap → glTexSubImage2D upload to
Mali-private BO) works correctly. CPU mmap triggers begin_cpu_access
sync. Mali-private BO is naturally cache-coherent. Per-frame implicit
sync via the CPU mmap path.

Why DMA_BUF_IOCTL_SYNC (phase 3) didn't help: that's CPU-side cache
management. Doesn't propagate to GPU IOMMU.

ROOT CAUSE: Same root cause as our upstream vb2_dma_resv RFC v2.
With V4L2 attaching dma_resv fence on DQBUF, mesa-panfrost's implicit
fence-wait at sample time enforces cache writeback. RFC v2 IS the fix.

Proposed iter2 path: build linux-pinetab2-danctnix-besser 7.0 with
RFC v2 patches applied, install on ohm, retest. ~2-3 hours total.

If green goes away, we have:
  - confirmation that our RFC fixes a shipping-product bug
  - a locally shippable kernel package
  - strong data point for the v2 cover letter

Posted to dmabuf-modifier-triage#1 comment 260.
2026-05-08 22:40:51 +00:00

149 lines
15 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 2 — iter1 source-read findings (REOPEN of root-cause analysis)
**Opened 2026-05-08** during the iter1 phase 2 source-read of mpv 0.41.0 + Kwiboo's ffmpeg fork at commit `b57fbbe`. Phase 0's earlier conclusion ("mpv mixes per-plane fds with single-allocation offset") needs revision — the source code reads + runtime probe show the situation is more nuanced than the WAYLAND_DEBUG wire trace alone suggested.
## What the source actually says
**mpv `video/out/vo_dmabuf_wayland.c` `drmprime_dmabuf_importer` (lines 250-277)** straightforwardly relays the producer's `AVDRMFrameDescriptor`:
```c
for (plane_no = 0; plane_no < layer.nb_planes; ++plane_no) {
AVDRMPlaneDescriptor plane = layer.planes[plane_no];
int object_index = plane.object_index;
AVDRMObjectDescriptor object = desc->objects[object_index];
uint64_t modifier = object.format_modifier;
zwp_linux_buffer_params_v1_add(params, object.fd, plane_no, plane.offset,
plane.pitch, modifier >> 32, modifier & 0xffffffff);
}
```
No `dup()`, no rewriting, no transformation. mpv passes through what `AVDRMFrameDescriptor` says.
**Kwiboo's `libavutil/hwcontext_v4l2request.c` `v4l2request_set_drm_descriptor` (lines 138-198)** for hantro's NV12 single-planar (V4L2_PIX_FMT_NV12, the format `v4l2-ctl --get-fmt-video-mplane-cap` reports for `/dev/video1` on ohm):
```c
desc->base.nb_objects = num_planes; // = 1 for single-planar NV12 on hantro
desc->base.objects[0].fd = exportbuffer.fd; // VIDIOC_EXPBUF returns ONE fd
// in v4l2request_set_drm_descriptor:
desc->nb_layers = 1;
layer->nb_planes = 1;
layer->planes[0].object_index = 0;
layer->planes[0].offset = 0;
layer->planes[0].pitch = bytesperline; // 1920
if (modifier != ARM_VENDOR) { // hantro outputs LINEAR (0x0), so this is true
layer->nb_planes = 2;
layer->planes[1].object_index = 0; // ← BOTH PLANES point at object 0
layer->planes[1].offset = pitch * height; // 1920 * 1088 = 2088960
layer->planes[1].pitch = layer->planes[0].pitch;
}
```
Per the source, mpv should produce **identical** fd values in the two `.add()` calls — both pulling from `desc->objects[0].fd`.
## What the runtime probe says
`v4l2-ctl --get-fmt-video-mplane-cap` on ohm `/dev/video1`:
```
Pixel Format : 'NV12' (Y/UV 4:2:0)
Number of planes : 1
sizeimage=3655712, bytesperline=1920
```
`strace -e trace=ioctl mpv ...` confirms ffmpeg only does **one** `VIDIOC_EXPBUF` per CAPTURE buffer (`index=N, plane=0` → one fd), exactly matching `nb_objects = 1`.
But `WAYLAND_DEBUG=1 mpv ...` shows two `.add()` calls **with different fd numbers** per buffer:
```
add(fd 41, 0, 0, 1920, 0, 0)
add(fd 42, 1, 2088960, 1920, 0, 0)
```
These fd numbers are **consecutive**, suggesting libwayland's `wl_closure_marshal` is `dup_cloexec`'ing the fd at protocol-marshal time and the trace prints the post-dup fd. Both fd 41 and fd 42 are dups of the same underlying `dma_buf` object (originally fd 17 or similar in mpv's table).
## Implications for iter1
The earlier phase 0 conclusion that mpv constructs an "internally inconsistent" wl_dmabuf message was **wrong**. There is no inconsistency at the producer ↔ mpv layer:
- nb_objects = 1, both planes use object 0 → mpv passes the same fd value into both `.add()` calls
- libwayland dups it before sending → wire trace shows different fd numbers, but they refer to the same backing memory
- Plane 1's offset = 2088960 is correct relative to the (single) underlying allocation
So the green frame is **not caused by mpv or ffmpeg's descriptor construction**. Something else.
## New hypothesis space (one of these is the real bug)
1. **Mali-G52 panfrost EGL_EXT_image_dma_buf_import_modifiers regression for NV12 with non-zero plane offset.** **Source-read 2026-05-08 of Mesa 26.0.6 makes this LESS LIKELY.** Trace at `references/mesa-26.0.6/`:
- `src/gallium/drivers/panfrost/pan_screen.c:443` reports `external_only[count] = is_yuv` for any YUV format → NV12+LINEAR is external_only, forcing KWin's per-plane import path (Y as R8, UV as DRM_FORMAT_GR88).
- `src/loader/loader_dri_helper.c:43` maps `DRM_FORMAT_GR88 ↔ PIPE_FORMAT_RG88_UNORM` (the byte-order distinction is preserved at the pipe-format level — `.r` = byte 0 = Cb, `.g` = byte 1 = Cr — matching KWin's shader assumption at glshadermanager.cpp:189 `result.yz = sampler1.rg`).
- `src/gallium/drivers/panfrost/pan_resource.c:354-358` captures `whandle->offset` into `explicit_layout.offset_B` for the import.
- `src/panfrost/lib/pan_mod.c:663-667` (linear modifier slice-init) honors `layout_constraints->offset_B` directly with only an alignment check; 2,088,960 is page-aligned, satisfies 16-/64-/4096-byte alignments alike.
- `src/panfrost/lib/pan_texture.c:361,561,660,773,817` set the texture descriptor's GPU base to `plane->base + slayout->offset_B` — i.e., sampling reads from `bo_gpu + 2,088,960`.
- **Conclusion**: panfrost source code as written DOES honor non-zero plane offset. Source-read alone cannot rule out runtime bug — but the obvious places are clean. To definitively rule in/out, write the EGL importer harness with synthetic NV12 data.
2. ~~**KWin's wl_dmabuf import logic deduplicates the dup'd fds incorrectly.**~~ **RULED OUT 2026-05-08** by source-read of KWin 6.6.4 at `references/kwin-6.6.4/src/wayland/linuxdmabufv1clientbuffer.cpp` + `src/opengl/{eglbackend,egldisplay}.cpp`. (a) `LinuxDmaBufParamsV1::zwp_linux_buffer_params_v1_add` simply stores per-plane fd/offset/pitch in separate slots, no dedup. (b) `LinuxDmaBufParamsV1::test()` does `lseek(SEEK_END)` per plane + range checks against the resulting size; our 3,657,728 satisfies all of them. (c) `EglDisplay::importDmaBufAsImage` (both the combined and per-plane forms) passes `dmabuf.fd[i]`, `dmabuf.offset[i]`, `dmabuf.pitch[i]` straight to `eglCreateImage(EGL_LINUX_DMA_BUF_EXT, ...)` with no transformation. (d) `EglBackend::testImportBuffer` chooses between combined import and per-plane (Y as R8 / UV as RG88 from offset 2,088,960) based on whether NV12+LINEAR is in `nonExternalOnlySupportedDrmFormats()`. **Either path** forwards `offset = 2,088,960` to the driver. KWin is innocent.
3. ~~**hantro kernel driver exports a `dma_buf` with `size` < full allocation.**~~ **RULED OUT 2026-05-08** by `/tmp/expbuf_probe.c` on ohm. Driver `hantro-vpu` on `rk3568-vpu-dec` reports `CAPTURE: NV12 1920x1088 num_planes=1 sizeimage=3655712`; `VIDIOC_EXPBUF` yields fd whose `lseek(fd, 0, SEEK_END) = 3,657,728` (page-rounded up from 3,655,712). Offset 2,088,960 (plane 1 base) is firmly inside the exported size. Kernel is innocent.
Side observation worth recording: `sizeimage = 3,655,712` is bigger than naïve NV12's 1920×1088×1.5 = 3,133,440. The 522,272-byte excess sits **past** the UV plane (Y at [0, 2,088,960), UV at [2,088,960, 3,133,440), trailing padding at [3,133,440, 3,655,712)). On Rockchip codecs that tail commonly holds per-frame motion-vector / decoder-context data. Confirms ffmpeg's hardcoded `planes[1].offset = pitch*height = 2,088,960` is correct.
4. **kwin-fourier 0001 still has effect we missed.** Even though we ruled out kwin-fourier as a compositor-replacement A/B, that test was on an earlier kernel/Mesa combo. Worth verifying the test environment is fully reset.
6. ~~**DMA cache coherency between hantro VPU and Mali GPU** (NEW 2026-05-08, derived from green-color math).~~ **RULED OUT 2026-05-08 phase 3** by the iter1 patch test. Patched mpv 0.41.0 (mpv-fourier-1:0.41.0-9) to call `DMA_BUF_IOCTL_SYNC(SYNC_START|SYNC_RW)` + matching `SYNC_END` on each EXPBUF fd in both `vaapi_dmabuf_importer` and `drmprime_dmabuf_importer` before `zwp_linux_buffer_params_v1_add()`. Built via Gitea Actions, installed on ohm. Ran `mpv --hwdec=v4l2request --vo=dmabuf-wayland --fullscreen --pause --start=00:00:00.42 fourier-test/bbb_1080p30_h264.mp4`, captured screenshot. **Result: byte-identical to baseline (md5 c8c8e9b88521a0069f709d483451c3d4).** The userspace cache-sync ioctl has no effect. Either hantro's `dma_buf_ops->begin_cpu_access` is a no-op (likely on Rockchip — many dma-buf-heap allocations are non-coherent CPU-cached but rely on different sync paths), OR the gap is on the GPU consumer side and CPU-cache state is irrelevant.
**Critical phase-3 observation**: `--hwdec=v4l2request --vo=gpu` (texture-upload path) is known-working and renders correctly. That path mmap's the dma_buf into mpv's CPU memory, then uploads to a GL texture via `glTexSubImage2D`. **The CPU CAN read valid YUV data from the buffer**; only the zero-copy dma_buf-to-GPU import path renders zeros. This rules out "data isn't there" entirely and concentrates the hypothesis on the dma_buf → Mali GPU import/translation step itself (H7 territory).
7. **Panfrost dma_buf import doesn't perform GPU-side cache invalidation OR doesn't import with the right BO type.****CONFIRMED VIA SOURCE-READ 2026-05-08 phase 4** (Linux 6.12 panfrost source at `~/src/linux-rfc/drivers/gpu/drm/panfrost/`):
- `panfrost_gem_create_object` (panfrost_gem.c:262): `obj->base.map_wc = !pfdev->coherent;` — sets write-combine (uncached) CPU mapping when device isn't coherent. Applies to imports too via `drm_gem_shmem_prime_import_sg_table`.
- `pfdev->coherent` (panfrost_drv.c:625): `device_get_dma_attr(&pdev->dev) == DEV_DMA_COHERENT` — i.e., from the DT `dma-coherent` property on the panfrost node.
- On ohm (RK3566 PineTab2 besser-7.0): **NO `dma-coherent` property** anywhere in `/sys/firmware/devicetree/base/` (verified via `find ... -name dma-coherent`). So `pfdev->coherent = false`.
- `panfrost_mmu_map` (panfrost_mmu.c:330): **`int prot = IOMMU_READ | IOMMU_WRITE;`** — **no `IOMMU_CACHE`**. Imported BOs get mapped into Mali's IOMMU as non-snooping. Mali reads directly from DRAM.
- The only cache sync that occurs is **once** at `dma_buf_map_attachment` time (during EGL_image import in KWin). KWin caches the EGL_image per-fd in `m_importedBuffers` (eglbackend.cpp:282), reusing it for every subsequent frame.
- **No per-frame cache sync mechanism exists** — V4L2 doesn't attach `dma_resv` fences to CAPTURE buffers on DQBUF (the exact gap addressed by our `vb2_dma_resv` RFC v2 upstream).
**Architectural picture**: hantro VPU writes decoded YUV to its CMA buffer through CPU L1/L2/L3 caches. Mali GPU reads through its IOMMU with no cache snoop. Without per-frame fence-driven cache flush, Mali sees DRAM-direct content — which lags behind hantro's writes (often zero-fill of fresh-allocated pages). **Result: green frames.**
**Counter-validation**: `mpv --hwdec=v4l2request --vo=gpu` (CPU-mmap of dma_buf → glTexSubImage2D upload to Mali-private BO) works correctly. CPU mmap triggers cache sync via dma-buf's `begin_cpu_access`. The Mali-private destination BO is normally cached/coherent because Mali allocated it. Per-frame implicit cache sync via the CPU mmap path. Confirms the buffer DOES contain valid data and only the zero-copy dma_buf-to-Mali-IOMMU path lacks per-frame sync.
**Why `DMA_BUF_IOCTL_SYNC` (phase 3) didn't help**: that ioctl invokes `dma_buf_ops->begin_cpu_access` / `end_cpu_access` — both **CPU**-side cache management. They don't propagate to the GPU's IOMMU mapping. The GPU still reads through its non-snooping mapping; CPU cache state is irrelevant to it.
**Real fix path**: kernel-side V4L2 `vb2_dma_resv` patches (our upstream RFC v2). With V4L2 attaching a `dma_resv` fence on DQBUF for CAPTURE, mesa-panfrost's implicit fence-wait at sample time will block until hantro's writes signal — and the fence signaling semantics imply cache writeback. The fence-wait + cache-flush combination resolves the green frames.
## Recommended next moves for iter1
a. **Write a small C harness that does VIDIOC_EXPBUF on a hantro CAPTURE buffer and reports fd size + backing dma_buf info.** Decides hypothesis 3 in 30 minutes. Run on ohm directly.
b. **Patch mpv with `MP_VERBOSE` logging of the AVDRMFrameDescriptor fields at .add()-call time** (nb_objects, planes[].object_index, planes[].offset, objects[].size). Confirms the source-read is correct at runtime. Drop into mpv-fourier's `prepare()` slot, bump pkgrel, rebuild on fermi (~10 min CI).
c. **Read KWin's wl_dmabuf import logic** (KDE Plasma 6 / KWin 6.6.4 source) for how it handles multiple-fd-same-buffer cases. ~30 min source-read.
d. **Update `marfrit/dmabuf-modifier-triage#1`** with this revised analysis. The current issue body claims the bug is in mpv's plane-semantics translation — that conclusion is now overturned.
## Status
- iter1 phase 4 closed 2026-05-08. **H7 confirmed via panfrost kernel source-read.** The dmabuf-wayland green-frame bug is structurally caused by the *missing per-frame cache sync mechanism* between hantro VPU and Mali GPU, on a non-coherent SoC (RK3566), with KWin caching the EGL_image per-fd. V4L2 doesn't attach `dma_resv` fences to CAPTURE buffers on DQBUF, so panfrost has no per-frame fence to wait on, and never flushes cross-device cache between frames.
- Six hypotheses ruled out (H1, H2, H3, H5, H6, ad-hoc offset variant). H4 latent. **H7 leading and root-cause-confirmed**.
- Acceptance criterion (`screenshots/frame10_expected.png`) is unchanged.
- **Critical discovery for the campaign**: the dmabuf-wayland green is **the same root cause** as the upstream RFC we're already advancing. The `vb2_dma_resv` v2 patches we're preparing for the linux-media list ARE the fix for ohm.
### Proposed iter2 / phase 5 path
Take the kernel rebuild route. Build `linux-pinetab2-danctnix-besser` 7.0 with `vb2_dma_resv` RFC v2 patches applied, install on ohm, retest. If green goes away, we have:
1. Confirmation that our upstream RFC fixes a real shipping-product bug
2. A locally-shippable fix via `linux-pinetab2-fourier` (or similar fresnel-style kernel package)
3. A strong concrete data point to include in the v2 cover letter
Estimated effort:
- Apply RFC v2 patches to besser-7.0 source: ~30 min (patches need rebasing; current upstream RFC is against 6.12-rc, besser is 7.0)
- Build kernel via distcc (his can wire up DISTCC_HOSTS): ~45-90 min
- Install + reboot + retest: ~15 min
- Total: ~2-3 hours
### Alternative paths if kernel-rebuild blocks
a. **EGL importer harness with synthetic NV12 in udmabuf** (~1-2h): would CONFIRM by independent test that the issue is producer-cache-flush specifically (synthetic NV12 in CPU-allocated udmabuf would have writes via CPU mmap → naturally flushed; should render correctly even with current panfrost). Worth doing as additional evidence.
b. **mpv-fourier `--vo=dmabuf-wayland` workaround**: re-import the dma_buf each frame from mpv-side. Defeats zero-copy. Not desirable. Only viable as last-resort fallback.
c. **kwin-fourier workaround**: invalidate the cached EGL_image per-frame. Same downside (zero-copy defeated). But would help validate the kernel theory.
mpv-fourier-1:0.41.0-9 keeps the harmless no-op DMA_BUF_IOCTL_SYNC patch installed for now. If the kernel-rebuild path works, the patch will be removed (revert) in the next mpv-fourier rev.