Skip to main content
There are two places to get models, matching GenieX’s two runtimes:

Run a Qualcomm AI Hub Model

For a model in the ai-hub-models/* namespace, just geniex infer it:
windows
geniex infer ai-hub-models/Qwen3-4B
Qualcomm AI Engine Direct is NPU-only. Pass --compute npu or omit the flag.

Run a GGUF model from Hugging Face

Any compatible GGUF repo on Hugging Face works. Copy the repo path and pass it to geniex infer:
windows
geniex infer <org>/<repo>-GGUF
For example:
windows
geniex infer unsloth/Phi-4-mini-instruct-GGUF
When prompted:
  • Model typevlm for vision-language models (Qwen3-VL family), llm otherwise.
  • Precision (Quantization)Q4_0 for best Hexagon NPU support. See Precisions (Quantizations) Supported.

Set up a Hugging Face token

Some models on Hugging Face are gated — you must accept the model’s license on the Hugging Face website and provide an access token before GenieX can download them.
1

Create a token

Go to huggingface.co/settings/tokens and create a new token with Read access.
2

Accept the model license

Visit the model’s page on Hugging Face (e.g. https://huggingface.co/<org>/<model>) and accept the license/agreement if prompted.
3

Provide the token to GenieX

GenieX checks the following sources in order (first non-empty value wins):
  1. GENIEX_HFTOKEN environment variable
  2. HF_TOKEN environment variable
  3. ~/.cache/huggingface/token file
Option A — environment variable (recommended):
# Standard HF variable (works with other HF tools too)
$env:HF_TOKEN = "hf_..."

# Or GenieX-specific (takes priority over HF_TOKEN)
$env:GENIEX_HFTOKEN = "hf_..."
Option B — Hugging Face CLI login (persists to disk):
pip install huggingface_hub
huggingface-cli login
This writes the token to ~/.cache/huggingface/token, which GenieX reads automatically.
GENIEX_HFTOKEN takes highest priority, then HF_TOKEN, then the cached token file. Use GENIEX_HFTOKEN if you need a separate token for GenieX without affecting other Hugging Face tools.

Run a local Qualcomm AI Engine Direct bundle

Bundles outside the ai-hub-models/* namespace — whether self-converted from Hugging Face or already sitting on disk — are imported once, then run like any other model.
Register the bundle with geniex pull --local-path, then geniex infer it.Self-converted from Hugging FaceMake sure the Hugging Face CLI is installed, then download a bundle and pull it:
# Set model name (must match the HF subfolder name exactly)
$model = "llama_v3_2_3b_instruct"

# Download the model from Hugging Face
hf download yichqian/geniex-qairt-models `
  --local-dir $model `
  --include "$model/**"

# Pull using the downloaded folder (absolute path)
geniex pull local/$model --local-path (Resolve-Path "$model\$model").Path

# Run inference
geniex infer local/$model
Already on diskPoint --local-path at the extracted bundle directory (containing .bin shards and metadata.json):
geniex pull local/my-bundle --local-path C:\models\my-bundle
geniex infer local/my-bundle
You can also pull directly from an AI Hub .zip archive without extracting first:
geniex pull local/my-bundle --local-path C:\downloads\model.zip
geniex pull copies model files into its local cache. After a successful pull, you can safely delete the original download to reclaim disk space. Use geniex list to confirm the model is cached.

Run a local GGUF model

A local GGUF model is a directory (or file) holding your .gguf weights — side-loaded, produced by another tool, or already on disk. Import it once, then run it like any other model.
Point --local-path at the directory containing your .gguf file:
geniex pull local/my-model --local-path C:\models\my-model
geniex infer local/my-model

Precisions (Quantizations) Supported

Which precision you pick determines where the model runs.

llama.cpp

llama.cpp accepts the full range of GGML quantization formats. The CLI prompts you to pick one when you geniex pull a GGUF model:
Choose a precision version to download
> Q4_0       [1.2 GiB] (default)
  Q8_0       [2.0 GiB]
  F16        [3.8 GiB]
Precision (Quantization)Runs onNotes
Q4_0 (default)Hexagon NPUBest NPU support in llama.cpp. Recommended for most models.
Q8_0GPU / CPUBetter quality at ~2× the disk and memory cost.
F16GPU / CPUReference precision. Mainly for evaluation — large and slow.
Q4_K_M, Q5_K_M, etc.GPU / CPUMixed-precision K-quants. Not optimized for Hexagon NPU.
Stick with Q4_0 if you want the model to land on the Hexagon NPU. Other precisions will work but typically run on GPU or CPU.

Qualcomm AI Engine Direct

Qualcomm AI Engine Direct bundles are statically pre-quantized — there is no precision choice at runtime. To use a different precision, get a different bundle from Qualcomm AI Hub.
Precision (Quantization)Runs onNotes
w4a16 (most common)Hexagon NPUWeights int4, activations int16. Best balance of size and accuracy for on-device LLMs.
w4Hexagon NPUWeights int4, activations float. Slightly higher accuracy than w4a16 at the cost of more compute.
Most pre-built bundles on Qualcomm AI Hub use w4a16. The quantization level is baked into the bundle at compile time — pick the bundle that matches your quality/performance target.
For other bundle constraints (context length, KV cache, Android nCtx), see Qualcomm AI Engine Direct runtime constraints.