x11-session-research/phase0_evidence/wayland_baseline_2026-05-03/a1_protocol.md

# A1 baseline protocol — in-session Plasma Wayland anchor

**Goal:** acquire 3 in-session reps of a chromium-fourier
under-Plasma-Wayland-with-KWin video playback measurement, so
the X11 cells of the matrix have a same-session Wayland
reference to compare against. Per the campaign-contained-data
discipline, **this is the only Wayland baseline this campaign
uses**; predecessor numbers are reference history only.

## Cell

- **Browser:** `/tmp/chromium-ohm-gl-fix-step2/chrome`
  (chromium-fourier 149.0.7812.0, the existing predecessor
  build).
- **Page:** `file:///home/mfritsche/fourier-test/brave_drops_test.html`
  (a 30 fps H.264 / video element with autoplay + drops trajectory
  emitted to console at 1 Hz; used by the predecessor for all
  Phase 3 reps).
- **Session:** Plasma Wayland tty1 / session 433
  (the live one, autologin'd via revert.log entry 6).
- **Window:** windowed (default chromium behavior, no fullscreen).
- **Decode:** chromium-fourier's default decode path. With the
  Step 1 + Step 2 patches present, this is libva via
  `libva-v4l2-request-fourier` driver (V4L2 stateless on hantro).
- **Capture window:** 70 s starting at autoplay-detected.
- **Instrumentation:** `top -p kwin_wayland` (1 Hz),
  `top` (system, 1 Hz), `sudo perf record -F 99 -g
  --call-graph dwarf -p kwin_wayland`, browser stderr (catches
  the page's `DROPS_TRAJECTORY: t=Xs tot=Y drop=Z` 1 Hz log).
  **No `WAYLAND_DEBUG=1`** — this is the `nodebug` variant so
  the kwin %CPU and drop measurements aren't perturbed by
  WAYLAND_DEBUG's per-message overhead.

## Bound metrics per rep

Each rep's evidence dir contains:

- `start.txt` / `end.txt` / `capture_start.txt` — wall-clock
  timestamps of phases.
- `temp_pre.txt` / `temp_post.txt` — thermal_zone0 (cpu) at
  phase boundaries.
- `top_kwin.txt` — `kwin_wayland` %CPU samples (70 × 1 Hz).
- `top_full.txt` — system-wide top (70 × 1 Hz).
- `perf.data` — perf record at 99 Hz on kwin_wayland.
- `perf_report_self.txt` — perf report (sorted by overhead).
- `perf_report_top50.txt` — first 50 lines of perf report.
- `stderr.log` — full chromium stderr.
- `drops_trajectory.txt` — extracted DROPS_TRAJECTORY lines.
- `kwin_cpu_summary.txt` — kwin %CPU samples / median / mean /
  min / max.
- `drops_summary.txt` — `frames_total`, `drops_total`,
  `drops_post_warmup` (drops accumulated after t=10 s).

## Protocol

Three reps **back-to-back** with ≥ 30 s idle between to let
thermals settle. The whole campaign sequence takes ~5 minutes
of wall time:

```
T+0:00   rep 1: launch + 70s capture + cleanup       (~95s)
T+1:35   30s idle (thermal settle)
T+2:05   rep 2: same                                  (~95s)
T+3:40   30s idle
T+4:10   rep 3: same                                  (~95s)
T+5:45   done — pull evidence
```

**SSH-driven:** the orchestrator
`/home/mfritsche/phase3_prime_runs/run_browser_nodebug.sh
$RUN_ID chromium-fourier-kwin` runs end-to-end from a single
SSH command. Operator-side, **a chrome window will appear on
the screen for ~80 s per rep**; the only operator action is
**not interacting with that window** (no clicks, no typing in
the chrome window, no pulling focus). The orchestrator kills
the chrome process cleanly at end of capture.

After the 3 reps complete, this campaign's evidence
sub-directory `phase0_evidence/wayland_baseline_2026-05-03/`
will contain:

```
a1_rep1/   (moved from /home/mfritsche/phase3_prime_runs/x11research_a1_rep1/)
a1_rep2/
a1_rep3/
a1_summary.md   (this campaign's interpretation of the 3 reps)
```

The original predecessor evidence at
`/home/mfritsche/phase3_prime_runs/kwin_timing_nodebug_rep[1-3]`
is **untouched**.

## Exit conditions

- Per-rep success = `drops_summary.txt` exists with non-`n/a`
  values, `kwin_cpu_summary.txt` exists with samples > 0, perf
  report has > 1000 samples.
- Per-rep failure causes:
  - autoplay not detected within 30 s → script aborts, evidence
    dir is partial; rep marked failed.
  - workload exits before autoplay → script aborts.
  - perf record fails (e.g. paranoid > 1) → script continues
    but perf.data is empty; we'd see this in
    `perf_record_stderr.txt`.

If a rep fails, surface the cause and re-run that rep before
moving on.

## Decision after 3 reps

Compute median + IQR of `drops_post_warmup`, `frames_total`,
`drops_total`, and kwin %CPU across the three reps. Two
possible verdict shapes:

- **Tight cluster (IQR / median ≤ 0.3):** baseline is stable;
  Phase 1 binding cells can use the median as the anchor with
  the IQR as the tolerance band.
- **High variance (IQR / median > 0.3):** baseline is noisy;
  Phase 1 needs ≥ 5 reps per cell, not 3, and binding-cell
  thresholds need IQR-based formulation rather than fixed
  numbers. This is the predecessor lesson built into the
  worklist's "3 reps minimum (variance is a real concern)".

## Operator green-light request

Before I fire the 3 reps:

1. Confirm you're OK with **a chrome window popping up on the
   screen for ~80 s per rep × 3 reps**, and during that time
   **not interacting with it** (mouse stays still, no key
   presses).
2. Confirm the current Plasma Wayland session is in a "clean
   measurement state" — i.e. nothing else is doing significant
   CPU work (ideally close any active terminals/browsers/IDE
   windows you don't need; the predecessor's 36-37 % kwin %CPU
   baseline assumes a quiescent desktop with just the test
   chrome window plus normal Plasma services running).
3. (Optional) Decide whether to also include other variants in
   this turn's measurements — e.g. add a rep of `Brave 147` or
   `Firefox 150` under Plasma Wayland to start populating the
   full matrix. Default scope: just the 3 chromium-fourier
   reps; matrix-fill cells go into Phase 1 proper.

When green-lit, I fire `run_browser_nodebug.sh
x11research_a1_rep1 chromium-fourier-kwin` first as a smoke
test, surface its drops_summary.txt + kwin_cpu_summary.txt
output, then on confirmation fire reps 2 and 3 back-to-back.