- Qualcomm AI Hub — pre-compiled bundles for Qualcomm AI Engine Direct, plus curated GGUF models for llama.cpp.
- Hugging Face — any GGUF model for llama.cpp. See Run a GGUF model from Hugging Face.
Run a Qualcomm AI Hub Model
For a model in theai-hub-models/* namespace, just geniex infer it:
windows
Qualcomm AI Engine Direct is NPU-only. Pass
--compute npu or omit the flag.Run a GGUF model from Hugging Face
Any compatible GGUF repo on Hugging Face works. Copy the repo path and pass it togeniex infer:
windows
windows
- Model type —
vlmfor vision-language models (Qwen3-VL family),llmotherwise. - Precision (Quantization) —
Q4_0for best Hexagon NPU support. See Precisions (Quantizations) Supported.
Set up a Hugging Face token
Some models on Hugging Face are gated — you must accept the model’s license on the Hugging Face website and provide an access token before GenieX can download them.Create a token
Go to huggingface.co/settings/tokens and create a new token with Read access.
Accept the model license
Visit the model’s page on Hugging Face (e.g.
https://huggingface.co/<org>/<model>) and accept the license/agreement if prompted.Provide the token to GenieX
GenieX checks the following sources in order (first non-empty value wins):Option B — Hugging Face CLI login (persists to disk):This writes the token to
GENIEX_HFTOKENenvironment variableHF_TOKENenvironment variable~/.cache/huggingface/tokenfile
~/.cache/huggingface/token, which GenieX reads automatically.Run a local Qualcomm AI Engine Direct bundle
Bundles outside theai-hub-models/* namespace — whether self-converted from Hugging Face or already sitting on disk — are imported once, then run like any other model.
- CLI
- Python
- Android
Register the bundle with Already on diskPoint You can also pull directly from an AI Hub
geniex pull --local-path, then geniex infer it.Self-converted from Hugging FaceMake sure the Hugging Face CLI is installed, then download a bundle and pull it:--local-path at the extracted bundle directory (containing .bin shards and metadata.json):.zip archive without extracting first:geniex pull copies model files into its local cache. After a successful pull, you can safely delete the original download to reclaim disk space. Use geniex list to confirm the model is cached.Run a local GGUF model
A local GGUF model is a directory (or file) holding your.gguf weights — side-loaded, produced by another tool, or already on disk. Import it once, then run it like any other model.
- CLI
- Python
- Android
Point
--local-path at the directory containing your .gguf file:Precisions (Quantizations) Supported
Which precision you pick determines where the model runs.llama.cpp
llama.cpp accepts the full range of GGML quantization formats. The CLI prompts you to pick one when yougeniex pull a GGUF model:
| Precision (Quantization) | Runs on | Notes |
|---|---|---|
Q4_0 (default) | Hexagon NPU | Best NPU support in llama.cpp. Recommended for most models. |
Q8_0 | GPU / CPU | Better quality at ~2× the disk and memory cost. |
F16 | GPU / CPU | Reference precision. Mainly for evaluation — large and slow. |
Q4_K_M, Q5_K_M, etc. | GPU / CPU | Mixed-precision K-quants. Not optimized for Hexagon NPU. |
Qualcomm AI Engine Direct
Qualcomm AI Engine Direct bundles are statically pre-quantized — there is no precision choice at runtime. To use a different precision, get a different bundle from Qualcomm AI Hub.| Precision (Quantization) | Runs on | Notes |
|---|---|---|
w4a16 (most common) | Hexagon NPU | Weights int4, activations int16. Best balance of size and accuracy for on-device LLMs. |
w4 | Hexagon NPU | Weights int4, activations float. Slightly higher accuracy than w4a16 at the cost of more compute. |
nCtx), see Qualcomm AI Engine Direct runtime constraints.
Was this page helpful?