Prerequisites
- The CLI installed — see Install.
- Interactive shell from container (Docker only) — see Run interactively.
- Familiarity with runtime choice —
qairt(Qualcomm AI Engine Direct) for Qualcomm AI Hub Models,llama_cppfor any GGUF.
Run your first model
Qualcomm AI Engine Direct runtime (Qualcomm AI Hub)
Language model:windows
windows
llama.cpp runtime (GGUF)
PickQ4_0 when prompted — it has the best Hexagon NPU support.
Language model:
windows
windows
- Model type —
vlmfor vision-language models,llmfor text-only models. ForQwen3.5andGemma4, pickllmfor now (multimodal support coming soon). - Precision (Quantization) —
Q4_0for best Hexagon NPU performance.
Run a local model
Already have a model on disk, or want to self-convert a bundle from Hugging Face? Usegeniex pull with --local-path to register it, then run it like any other model. See:
- Run a local Qualcomm AI Engine Direct bundle — self-converted from Hugging Face, an extracted bundle directory, or an AI Hub
.zip. - Run a local GGUF model — a directory containing your
.gguffile.
Next steps
Local server
Expose an OpenAI-compatible HTTP API on
localhost:18181.CLI reference
Every command, every flag.
Was this page helpful?