rosenblatt/userspace/llm-runtime/README.md

# llm-runtime

llama.cpp fork (or out-of-tree backend) with the rknpu ggml backend.

Code lands here starting at Phase 5 (Plan) — too early in Phase 1.

Until then, this directory holds:
- design notes (`docs/architecture.md` from project root is authoritative)
- the eventual `ggml-rknpu/` backend source
- patch series for upstream submission if quality reaches that bar

## Approach

Two paths to consider in Phase 5:

1. **Fork llama.cpp, add backend in tree.** Easier to keep in sync;
   harder to upstream because llama.cpp may not want a Rockchip-specific
   backend that depends on a still-WIP mainline driver.
2. **Out-of-tree backend, load via llama.cpp's plugin API
   (`-DGGML_BACKEND_DL=ON`).** Cleaner separation; tracks llama.cpp
   upstream without our diff being in the way. Recommended unless we
   need to patch core llama.cpp logic.

Decision deferred to Phase 5.

## Model

Phase-1 target: `qwen2.5-1.5b-instruct-q4_k_m.gguf`. Source:
hf.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF or built locally with
llama-quantize.

Stretch: qwen2.5-3B (if memory + NPU SRAM allow), gemma3-2B.