diff --git a/phase2_iter1_findings.md b/phase2_iter1_findings.md index b1f39ef..5a9d624 100644 --- a/phase2_iter1_findings.md +++ b/phase2_iter1_findings.md @@ -93,11 +93,21 @@ So the green frame is **not caused by mpv or ffmpeg's descriptor construction**. **Critical phase-3 observation**: `--hwdec=v4l2request --vo=gpu` (texture-upload path) is known-working and renders correctly. That path mmap's the dma_buf into mpv's CPU memory, then uploads to a GL texture via `glTexSubImage2D`. **The CPU CAN read valid YUV data from the buffer**; only the zero-copy dma_buf-to-GPU import path renders zeros. This rules out "data isn't there" entirely and concentrates the hypothesis on the dma_buf → Mali GPU import/translation step itself (H7 territory). -7. **Panfrost dma_buf import doesn't perform GPU-side cache invalidation OR doesn't import with the right BO type** when mapping an imported fd. Even if data has reached DRAM, Mali's MMU/cache may serve stale reads, OR the imported BO is created without the `DMA-COHERENT` flag panfrost expects, leaving Mali sampling un-snooped memory. Phase 3 narrowed: this is now the LEADING hypothesis. Sub-cases to investigate: - - 7a. Panfrost's `dma_buf_attach` + `dma_buf_map_attachment` calls miss cache-invalidate. Source is `drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c` and `panfrost_drv.c`'s `panfrost_gem_prime_import_sg_table`. - - 7b. The imported BO is mapped non-coherent in Mali's MMU, but the buffer was allocated cacheable (or vice-versa). Sync mismatch. - - 7c. Panfrost uses ioremap_wc / ioremap_cache the wrong way for hantro-allocated CMA pages. - - 7d. The Mali-G52 panfrost path does NOT support imported dma_buf for sampling at all — only for scanout direct-pass-through. (Less likely; would mean GL texture creation should fail, but in our case it succeeds and renders zeros.) +7. **Panfrost dma_buf import doesn't perform GPU-side cache invalidation OR doesn't import with the right BO type.** ✅ **CONFIRMED VIA SOURCE-READ 2026-05-08 phase 4** (Linux 6.12 panfrost source at `~/src/linux-rfc/drivers/gpu/drm/panfrost/`): + - `panfrost_gem_create_object` (panfrost_gem.c:262): `obj->base.map_wc = !pfdev->coherent;` — sets write-combine (uncached) CPU mapping when device isn't coherent. Applies to imports too via `drm_gem_shmem_prime_import_sg_table`. + - `pfdev->coherent` (panfrost_drv.c:625): `device_get_dma_attr(&pdev->dev) == DEV_DMA_COHERENT` — i.e., from the DT `dma-coherent` property on the panfrost node. + - On ohm (RK3566 PineTab2 besser-7.0): **NO `dma-coherent` property** anywhere in `/sys/firmware/devicetree/base/` (verified via `find ... -name dma-coherent`). So `pfdev->coherent = false`. + - `panfrost_mmu_map` (panfrost_mmu.c:330): **`int prot = IOMMU_READ | IOMMU_WRITE;`** — **no `IOMMU_CACHE`**. Imported BOs get mapped into Mali's IOMMU as non-snooping. Mali reads directly from DRAM. + - The only cache sync that occurs is **once** at `dma_buf_map_attachment` time (during EGL_image import in KWin). KWin caches the EGL_image per-fd in `m_importedBuffers` (eglbackend.cpp:282), reusing it for every subsequent frame. + - **No per-frame cache sync mechanism exists** — V4L2 doesn't attach `dma_resv` fences to CAPTURE buffers on DQBUF (the exact gap addressed by our `vb2_dma_resv` RFC v2 upstream). + + **Architectural picture**: hantro VPU writes decoded YUV to its CMA buffer through CPU L1/L2/L3 caches. Mali GPU reads through its IOMMU with no cache snoop. Without per-frame fence-driven cache flush, Mali sees DRAM-direct content — which lags behind hantro's writes (often zero-fill of fresh-allocated pages). **Result: green frames.** + + **Counter-validation**: `mpv --hwdec=v4l2request --vo=gpu` (CPU-mmap of dma_buf → glTexSubImage2D upload to Mali-private BO) works correctly. CPU mmap triggers cache sync via dma-buf's `begin_cpu_access`. The Mali-private destination BO is normally cached/coherent because Mali allocated it. Per-frame implicit cache sync via the CPU mmap path. Confirms the buffer DOES contain valid data and only the zero-copy dma_buf-to-Mali-IOMMU path lacks per-frame sync. + + **Why `DMA_BUF_IOCTL_SYNC` (phase 3) didn't help**: that ioctl invokes `dma_buf_ops->begin_cpu_access` / `end_cpu_access` — both **CPU**-side cache management. They don't propagate to the GPU's IOMMU mapping. The GPU still reads through its non-snooping mapping; CPU cache state is irrelevant to it. + + **Real fix path**: kernel-side V4L2 `vb2_dma_resv` patches (our upstream RFC v2). With V4L2 attaching a `dma_resv` fence on DQBUF for CAPTURE, mesa-panfrost's implicit fence-wait at sample time will block until hantro's writes signal — and the fence signaling semantics imply cache writeback. The fence-wait + cache-flush combination resolves the green frames. ## Recommended next moves for iter1 @@ -111,14 +121,28 @@ d. **Update `marfrit/dmabuf-modifier-triage#1`** with this revised analysis. The ## Status -- iter1 phase 3 closed 2026-05-08. The DMA_BUF_IOCTL_SYNC patch (mpv-fourier-1:0.41.0-9, both vaapi_dmabuf_importer + drmprime_dmabuf_importer) had **zero effect** — green-frame screenshot byte-identical to baseline. **H6 ruled out.** -- Five hypotheses ruled out (H2, H3, H5, H6, the ad-hoc offset variant). H1 less-likely after Mesa source-read but not conclusively excluded. -- **Leading hypothesis: H7** — panfrost's dma_buf import / GPU-side cache or BO-type handling. Pinned by the *known-good* counter-test: `mpv --hwdec=v4l2request --vo=gpu` (CPU-mmap → glTexSubImage2D upload path) renders correctly. So the buffer DOES contain valid data; only the zero-copy dma_buf→Mali path renders zeros. +- iter1 phase 4 closed 2026-05-08. **H7 confirmed via panfrost kernel source-read.** The dmabuf-wayland green-frame bug is structurally caused by the *missing per-frame cache sync mechanism* between hantro VPU and Mali GPU, on a non-coherent SoC (RK3566), with KWin caching the EGL_image per-fd. V4L2 doesn't attach `dma_resv` fences to CAPTURE buffers on DQBUF, so panfrost has no per-frame fence to wait on, and never flushes cross-device cache between frames. +- Six hypotheses ruled out (H1, H2, H3, H5, H6, ad-hoc offset variant). H4 latent. **H7 leading and root-cause-confirmed**. - Acceptance criterion (`screenshots/frame10_expected.png`) is unchanged. -- Delivery vehicle re-evaluation again: with H6 gone, the userspace mpv workaround is no longer the right delivery vehicle for this iteration. The fix lands in: (a) panfrost kernel-mode driver (`drivers/gpu/drm/panfrost/`), (b) Mesa-panfrost userspace if there's an EGL_image attribute / format-import quirk, (c) hantro driver-side allocation flags (V4L2_MEMORY_DMABUF + appropriate cache attribute), or (d) a kernel bridge (e.g., DMA_BUF_IOCTL_SET_NAME with cache-aware variant). -- Next probe options ranked: - 1. **Read panfrost kernel-mode dma_buf import** (~45 min, cheap source-read, no hardware): inspect `drivers/gpu/drm/panfrost/panfrost_gem.c` `panfrost_gem_prime_import_sg_table` and Mali MMU mapping for cache attributes / IO-coherency settings. May spot the gap directly. - 2. **EGL importer harness with synthetic NV12 in udmabuf** (~1-2h): allocate via udmabuf (CPU-coherent), write known YUV pattern, eglCreateImage from the udmabuf, render, glReadPixels. If it reads back correct data → bug is hantro-allocated-buffer-specific (cache-attribute mismatch). If it ALSO reads zeros → general panfrost dma_buf import bug (less likely). - 3. **Run mpv with `MESA_DEBUG=verbose` + `PAN_MESA_DEBUG=sync,trace`** (~15 min): may show something at the import boundary. Cheap recon. +- **Critical discovery for the campaign**: the dmabuf-wayland green is **the same root cause** as the upstream RFC we're already advancing. The `vb2_dma_resv` v2 patches we're preparing for the linux-media list ARE the fix for ohm. -mpv-fourier-1:0.41.0-9 keeps the no-op patch installed for now — it's harmless. Future iter close (iter2 or further phase under iter1) will replace it with whatever the actual fix is, OR pkgrel-bump back to remove the dead patch if the fix lands elsewhere (kernel/Mesa). +### Proposed iter2 / phase 5 path + +Take the kernel rebuild route. Build `linux-pinetab2-danctnix-besser` 7.0 with `vb2_dma_resv` RFC v2 patches applied, install on ohm, retest. If green goes away, we have: +1. Confirmation that our upstream RFC fixes a real shipping-product bug +2. A locally-shippable fix via `linux-pinetab2-fourier` (or similar fresnel-style kernel package) +3. A strong concrete data point to include in the v2 cover letter + +Estimated effort: +- Apply RFC v2 patches to besser-7.0 source: ~30 min (patches need rebasing; current upstream RFC is against 6.12-rc, besser is 7.0) +- Build kernel via distcc (his can wire up DISTCC_HOSTS): ~45-90 min +- Install + reboot + retest: ~15 min +- Total: ~2-3 hours + +### Alternative paths if kernel-rebuild blocks + +a. **EGL importer harness with synthetic NV12 in udmabuf** (~1-2h): would CONFIRM by independent test that the issue is producer-cache-flush specifically (synthetic NV12 in CPU-allocated udmabuf would have writes via CPU mmap → naturally flushed; should render correctly even with current panfrost). Worth doing as additional evidence. +b. **mpv-fourier `--vo=dmabuf-wayland` workaround**: re-import the dma_buf each frame from mpv-side. Defeats zero-copy. Not desirable. Only viable as last-resort fallback. +c. **kwin-fourier workaround**: invalidate the cached EGL_image per-frame. Same downside (zero-copy defeated). But would help validate the kernel theory. + +mpv-fourier-1:0.41.0-9 keeps the harmless no-op DMA_BUF_IOCTL_SYNC patch installed for now. If the kernel-rebuild path works, the patch will be removed (revert) in the next mpv-fourier rev.