# npu-probe Smallest-possible userspace binary that: 1. Opens the NPU device (path TBD per Phase-1 audit) 2. Allocates two INT8 input tensors (64×64) + one output (64×64) 3. Submits a matmul via the uAPI in use (Tomeu's accel ioctl OR our own shim around vendor MMIO if accel-mainline isn't ready) 4. Waits for completion (DMA fence or polled completion register) 5. Reads back the output 6. Compares to a CPU INT8 matmul reference; reports pass/fail **Phase-1 deliverable.** Until this works, nothing else in this repo can be exercised against real silicon. ## Build _(filled when Phase-1 audit picks the uAPI shape — `meson` or `cmake`, no autotools)_ ## Run ``` ./npu-probe # default 64×64 INT8 matmul ./npu-probe --shape 128,128,128 # M,N,K override ./npu-probe --device /dev/accel/accel0 # override device path ./npu-probe --golden golden_64x64.bin # provide expected output for diff ``` ## Why C, not Python Direct ioctl + dmabuf + mmap. Python wrapper layer would obscure the exact syscall sequence we need to understand. Once npu-probe works, a Python binding for benchmark scripts is fine.