Iteration 5 Phase 5: sonnet review YELLOW caveats addressed
Sonnet caught four iter5 Phase 5 review caveats: - C1: sweep missed 4 more DEBUG sites (3 in surface.c, 1 in h264.c) - C2: static bool readback_warned was new process-global state - C3: msync removal not pixel-verified (probably safe; named caveat) - C4: Track B "implicit fix" overstated — reframe as "doesn't reproduce" C1 + C2 resolved in fork commit c8b6ede (107 lines removed). C3 and C4 documented in phase4_iter5_plan.md as named caveats. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+19
-15
@@ -30,18 +30,17 @@ Audit confirmed only LAST_OUTPUT_* was mutable process-global state. Other stati
|
||||
|
||||
**Verification:** two concurrent mpv processes with 2-second stagger both decoded 300 frames cleanly, no cross-context corruption. Sub-second co-launch hits kernel-level fd contention on /dev/video1 (hantro is a single-instance device); cross-process serialization is out of scope for a libva backend.
|
||||
|
||||
## Track B — mpv libplacebo `--vo=gpu` segfault ✓ COMPLETE (implicit fix)
|
||||
## Track B — mpv libplacebo `--vo=gpu`: doesn't reproduce on this consumer ✓
|
||||
|
||||
iter3 substrate documented the segfault: Vulkan init fails → mpv falls through to GPU non-vulkan path → 4 frames decode → REQBUFS EBUSY → bizarre CreateSurfaces2 with `sizes[1]=1050626` (uninitialized memory) → SIGSEGV.
|
||||
|
||||
**Empirical re-test on iter5-end driver (post-A + post-E):** `mpv --hwdec=vaapi --vo=gpu` ran for 32 seconds of stream content (all of `--frames=200` + sustained beyond), 98 dropped frames out of ~768, **zero segfaults / SIGSEGV / VK_ERROR_DEVICE_LOST / abort()**. The Vulkan-init-failed warnings still appear ("EnumeratePhysicalDevices ... VK_ERROR_INITIALIZATION_FAILED") but mpv successfully fall-through-decodes.
|
||||
**Empirical re-test on iter5-end driver (post-A + post-E):** `mpv --hwdec=vaapi --vo=gpu` ran for 32 seconds of stream content (all of `--frames=200` + sustained beyond), 98 dropped frames out of ~768, **zero segfaults / SIGSEGV / VK_ERROR_DEVICE_LOST / abort()**. The Vulkan-init-failed warnings still appear ("EnumeratePhysicalDevices ... VK_ERROR_INITIALIZATION_FAILED") and that's steady-state on Mali-G52 / Bifrost (no PanVk for that GPU yet — see `reference_pinetab_no_vulkan.md` memory). mpv falls through to GLES via Panfrost.
|
||||
|
||||
The iter3-era crash was implicitly fixed somewhere between iter3 and iter5, most likely by:
|
||||
- iter4's fresh-request_fd-per-frame fix (`385dee1`): timing change closes the cap_pool race window where REQBUFS EBUSY surfaces.
|
||||
- iter4's DPB fields/used-only fixes: kernel state stays consistent, no garbage CreateSurfaces2.
|
||||
- iter5's per-driver-data move: race elimination on resolution-change.
|
||||
Phase 5 sonnet review (C4): "implicit fix" overstated — refined to **"doesn't reproduce on this consumer pattern."** The iter4 + iter5 code changes don't directly close the cap_pool REQBUFS-EBUSY-on-resolution-change path that the iter3 substrate documented. The 32s GLES test path doesn't exercise the probe-with-garbage-dimensions consumer pattern that originally triggered the SIGSEGV. So the failure shape is *latent*, not *closed*: a future libva consumer that probes with `vaCreateSurfaces(16, 16)` between two `vaCreateSurfaces(1920, 1088)` calls while CAPTURE STREAMON is active could still hit the same path.
|
||||
|
||||
No iter5 code change required for Track B beyond what A + E already landed. The iter3-era documentation in `phase0_findings_iter3.md` was correct that the bug was real, but the bug is gone now.
|
||||
The cap_pool drain ordering concern survives as iter6+ candidate. iter5's Track B success criterion ("≥30s of bbb_1080p30 without segfault — OR root cause documented as upstream issue with workaround") is satisfied by the 32s clean run; the named caveat (cap_pool race window still latent under untested consumer patterns) is documented here.
|
||||
|
||||
No iter5 code change required for Track B beyond what A + E landed. Phase 5 review C4 framing applied.
|
||||
|
||||
## Track G — PGO-disabled Firefox rebuild (in progress)
|
||||
|
||||
@@ -51,16 +50,21 @@ Single-pass build kicked at iter5 Phase 4G start; running on boltzmann firefox-f
|
||||
|
||||
Will deploy to `/opt/firefox-fourier/` replacing the iter3 PGO-instrumented binary. Expected libxul.so size delta: 3.6 GB (PGO instrumented) → ~150-300 MB (release). Phase 7G verifies on-ohm playback.
|
||||
|
||||
## Phase 4 → Phase 5 transition
|
||||
## Phase 5 sonnet review caveats addressed (in commit `c8b6ede`)
|
||||
|
||||
Phase 4 deliverables landed for A + E + B. G in progress. Phase 5 sonnet review will cover:
|
||||
- Track A correctness: did any sweep removal break load-bearing code?
|
||||
- Track E semantics: is per-driver-data the right binding unit for last_output_*?
|
||||
- Track B verification: is "32s clean" sufficient or do we need longer/different content?
|
||||
- Track G: post-rebuild deployment + Firefox-side verification once package() finishes.
|
||||
Phase 5 review came back YELLOW with four caveats. Three resolved in code, one in documentation:
|
||||
|
||||
- **C1 (Track A incomplete):** sweep missed three surface.c DEBUG sites (CreateSurfaces2 format-dump, ExportSurfaceHandle descriptor-dump, QuerySurfaceStatus status-dump) and a "3F observability" V4L2 readback block in h264.c. Resolved in `c8b6ede` — additional 107-line removal.
|
||||
- **C2 (`static bool readback_warned`):** new mutable process-global state introduced inside the readback block. Resolved by removing the readback block entirely (point above).
|
||||
- **C3 (msync removal pixel-correctness):** msync(MS_SYNC|MS_INVALIDATE) was paired with the iter1 hex-dump and removed alongside it. The CAPTURE buffer is read post-DQBUF via `copy_surface_to_image` (image.c) for vaapi-copy; on a CMA-backed non-coherent setup this could in principle need cache invalidation. Empirical: 2000-frame stress with 0 errors, no visible decode failure. Likely the kernel does DMA sync at DQBUF level. Documented as named caveat — frame-hash spot check could anchor it formally if needed; for now accept based on empirical pass.
|
||||
- **C4 (Track B "implicit fix" overstated):** reframed above as "doesn't reproduce on this consumer pattern." Cap_pool resolution-change race window remains latent under untested consumer probe patterns.
|
||||
|
||||
## Phase 4 → Phase 6/7/8 transition
|
||||
|
||||
Phase 4 + Phase 5 done for A + E + B. G in progress.
|
||||
|
||||
Phase 7 verification anchored:
|
||||
- A: 2000-frame mpv vaapi-copy stress, 0 EINVAL, log size 4.4 KB
|
||||
- A: 2000-frame mpv vaapi-copy stress, 0 EINVAL, log size ~4 KB. Plus post-Phase-5-cleanup will need a re-smoke once ohm is back online (additional 107 lines removed; verify still green).
|
||||
- E: 2-process concurrent mpv (300 frames each, 2s stagger), both clean
|
||||
- B: mpv --vo=gpu 32s, 0 segfaults
|
||||
- B: mpv --vo=gpu 32s, 0 segfaults (on GLES fallback path)
|
||||
- G: pending package + deploy
|
||||
|
||||
Reference in New Issue
Block a user