Models - Qualcomm® AI Hub GenieX

There are two places to get models, matching GenieX’s two runtimes:

Qualcomm AI Hub — pre-compiled bundles for Qualcomm AI Engine Direct, plus curated GGUF models for llama.cpp.
Hugging Face — any GGUF model for llama.cpp. See Run a GGUF model from Hugging Face.

Run a Qualcomm AI Hub Model

For a model in the ai-hub-models/* namespace, just geniex infer it:

windows

geniex infer ai-hub-models/Qwen3-4B

Qualcomm AI Engine Direct is NPU-only. Pass --compute npu or omit the flag.

Run a GGUF model from Hugging Face

Any compatible GGUF repo on Hugging Face works. Copy the repo path and pass it to geniex infer:

windows

geniex infer <org>/<repo>-GGUF

For example:

windows

geniex infer unsloth/Phi-4-mini-instruct-GGUF

When prompted:

Model type — vlm for vision-language models (Qwen3-VL family), llm otherwise.
Precision (Quantization) — Q4_0 for best Hexagon NPU support. See Precisions (Quantizations) Supported.

Set up a Hugging Face token

Some models on Hugging Face are gated — you must accept the model’s license on the Hugging Face website and provide an access token before GenieX can download them.

Create a token

Go to huggingface.co/settings/tokens and create a new token with Read access.

Accept the model license

Visit the model’s page on Hugging Face (e.g. https://huggingface.co/<org>/<model>) and accept the license/agreement if prompted.

Provide the token to GenieX

GenieX checks the following sources in order (first non-empty value wins):

GENIEX_HFTOKEN environment variable
HF_TOKEN environment variable
~/.cache/huggingface/token file

Option A — environment variable (recommended):

# Standard HF variable (works with other HF tools too)
$env:HF_TOKEN = "hf_..."

# Or GenieX-specific (takes priority over HF_TOKEN)
$env:GENIEX_HFTOKEN = "hf_..."

Option B — Hugging Face CLI login (persists to disk):

pip install huggingface_hub
huggingface-cli login

This writes the token to ~/.cache/huggingface/token, which GenieX reads automatically.

GENIEX_HFTOKEN takes highest priority, then HF_TOKEN, then the cached token file. Use GENIEX_HFTOKEN if you need a separate token for GenieX without affecting other Hugging Face tools.

Run a local Qualcomm AI Engine Direct bundle

Bundles outside the ai-hub-models/* namespace — whether self-converted from Hugging Face or already sitting on disk — are imported once, then run like any other model.

CLI
Python
Android

Register the bundle with geniex pull --local-path, then geniex infer it.Self-converted from Hugging FaceMake sure the Hugging Face CLI is installed, then download a bundle and pull it:

# Set model name (must match the HF subfolder name exactly)
$model = "llama_v3_2_3b_instruct"

# Download the model from Hugging Face
hf download yichqian/geniex-qairt-models `
  --local-dir $model `
  --include "$model/**"

# Pull using the downloaded folder (absolute path)
geniex pull local/$model --local-path (Resolve-Path "$model\$model").Path

# Run inference
geniex infer local/$model

Already on diskPoint --local-path at the extracted bundle directory (containing .bin shards and metadata.json):

geniex pull local/my-bundle --local-path C:\models\my-bundle
geniex infer local/my-bundle

You can also pull directly from an AI Hub .zip archive without extracting first:

geniex pull local/my-bundle --local-path C:\downloads\model.zip

geniex pull copies model files into its local cache. After a successful pull, you can safely delete the original download to reclaim disk space. Use geniex list to confirm the model is cached.

Point from_pretrained at the bundle directory and force the Qualcomm AI Engine Direct runtime with device_map="qairt":

from geniex import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    r"C:\<your_path>\Qwen3-4B-Instruct-2507",   # any .bin file or the bundle folder
    model_name="qwen3_4b_instruct_2507",        # matches the Qualcomm AI Hub model id
    device_map="qairt",                         # required for Qualcomm AI Engine Direct
)

See the Python API reference for all parameters.

Set hub = HubSource.LOCALFS and point local_path at an extracted AI Hub directory (metadata.json plus one or more .bin shards) or the AI Hub .zip. pullFlow imports it into the SDK cache (no network); no chipset is required — the bundle is already compiled for one. The modality (LLM vs VLM) is read from metadata.json.

ModelManagerWrapper.pullFlow(
    ModelPullInput(
        model_name = "local/qwen3-4b-2507",
        hub        = HubSource.LOCALFS,
        local_path = "/data/local/tmp/Qwen3-4B-Instruct-2507",  // extracted dir or .zip
    )
).collect { /* Progress / Completed / Error */ }

Then load with runtime_id = "qairt" — identical to the downloaded-model flow:

val paths = ModelManagerWrapper.getPaths("local/qwen3-4b-2507")
    ?: error("Model not imported")

val llm = LlmWrapper.builder()
    .llmCreateInput(
        LlmCreateInput(
            model_name = paths.model_name,
            model_path = paths.model_path,
            config     = ModelConfig(max_tokens = 2048, enable_thinking = false),
            runtime_id  = "qairt",
            compute_unit  = null,   // qairt is NPU-only
        )
    )
    .build()
    .getOrThrow()

The model_name is just the cache key — any org/repo-style string works; a local/... prefix is a common convention. You don’t author a geniex.json for a local import — the model manager generates one during import. A QAIRT folder is recognized by metadata.json + .bin shards (or a .zip); a directory of loose .bin files with no metadata.json won’t be detected. When importing models of mixed types, prefer paths.runtime_id (authoritative) over hard-coding runtime_id.

Run a local GGUF model

A local GGUF model is a directory (or file) holding your .gguf weights — side-loaded, produced by another tool, or already on disk. Import it once, then run it like any other model.

CLI
Python
Android

Point --local-path at the directory containing your .gguf file:

geniex pull local/my-model --local-path C:\models\my-model
geniex infer local/my-model

Pass the path to a local .gguf file straight to from_pretrained:

from geniex import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    r"C:\models\my-model\model.gguf",   # path to a local .gguf file
    device_map="auto",                  # auto -> hybrid for llama_cpp
)

See the Python API reference for all parameters.

Set hub = HubSource.LOCALFS and point local_path at a directory containing the .gguf file(s). A geniex.json manifest is used if present; otherwise the layout is inferred from the file names. For a VLM, drop the mmproj-*.gguf into the same directory.

ModelManagerWrapper.pullFlow(
    ModelPullInput(
        model_name = "local/qwen3-0.6b",
        hub        = HubSource.LOCALFS,
        local_path = "/data/local/tmp/qwen3-0.6b",   // dir containing *.gguf
    )
).collect { /* Progress / Completed / Error */ }

Then load with runtime_id = "llama_cpp" — identical to the downloaded-model flow:

val paths = ModelManagerWrapper.getPaths("local/qwen3-0.6b")
    ?: error("Model not imported")

val llm = LlmWrapper.builder()
    .llmCreateInput(
        LlmCreateInput(
            model_name = paths.model_name,
            model_path = paths.model_path,
            config     = ModelConfig(nCtx = 4096),
            runtime_id  = "llama_cpp",
            compute_unit  = null,   // null → NPU on Snapdragon (recommended)
        )
    )
    .build()
    .getOrThrow()

Precisions (Quantizations) Supported

Which precision you pick determines where the model runs.

llama.cpp

llama.cpp accepts the full range of GGML quantization formats. The CLI prompts you to pick one when you geniex pull a GGUF model:

Choose a precision version to download
> Q4_0       [1.2 GiB] (default)
  Q8_0       [2.0 GiB]
  F16        [3.8 GiB]

Precision (Quantization)	Runs on	Notes
`Q4_0` (default)	Hexagon NPU	Best NPU support in llama.cpp. Recommended for most models.
`Q8_0`	GPU / CPU	Better quality at ~2× the disk and memory cost.
`F16`	GPU / CPU	Reference precision. Mainly for evaluation — large and slow.
`Q4_K_M`, `Q5_K_M`, etc.	GPU / CPU	Mixed-precision K-quants. Not optimized for Hexagon NPU.

Stick with Q4_0 if you want the model to land on the Hexagon NPU. Other precisions will work but typically run on GPU or CPU.

Qualcomm AI Engine Direct

Qualcomm AI Engine Direct bundles are statically pre-quantized — there is no precision choice at runtime. To use a different precision, get a different bundle from Qualcomm AI Hub.

Precision (Quantization)	Runs on	Notes
`w4a16` (most common)	Hexagon NPU	Weights int4, activations int16. Best balance of size and accuracy for on-device LLMs.
`w4`	Hexagon NPU	Weights int4, activations float. Slightly higher accuracy than w4a16 at the cost of more compute.

Most pre-built bundles on Qualcomm AI Hub use w4a16. The quantization level is baked into the bundle at compile time — pick the bundle that matches your quality/performance target.

For other bundle constraints (context length, KV cache, Android nCtx), see Qualcomm AI Engine Direct runtime constraints.

Was this page helpful?

Yes

​Run a Qualcomm AI Hub Model

​Run a GGUF model from Hugging Face

​Set up a Hugging Face token

​Run a local Qualcomm AI Engine Direct bundle

​Run a local GGUF model

​Precisions (Quantizations) Supported

​llama.cpp

​Qualcomm AI Engine Direct

Run a Qualcomm AI Hub Model

Run a GGUF model from Hugging Face

Set up a Hugging Face token

Run a local Qualcomm AI Engine Direct bundle

Run a local GGUF model

Precisions (Quantizations) Supported

llama.cpp

Qualcomm AI Engine Direct