> ## Documentation Index > Fetch the complete documentation index at: https://geniex.aihub.qualcomm.com/llms.txt > Use this file to discover all available pages before exploring further. # Models > Where to find models, how to run them, and which precisions land on the Snapdragon NPU. There are two places to get models, matching GenieX's two [runtimes](/en/get-started/platforms#geniex-runtimes): * **[Qualcomm AI Hub](https://aihub.qualcomm.com/models/)** — pre-compiled bundles for Qualcomm AI Engine Direct, plus curated GGUF models for llama.cpp. * **[Hugging Face](https://huggingface.co/models?library=gguf)** — any GGUF model for llama.cpp. See [Run a GGUF model from Hugging Face](#run-a-gguf-model-from-hugging-face). ## **Run a Qualcomm AI Hub Model** For a model in the `ai-hub-models/*` namespace, just `geniex infer` it: ```powershell windows theme={"dark"} geniex infer ai-hub-models/Qwen3-4B ``` Qualcomm AI Engine Direct is NPU-only. Pass `--compute npu` or omit the flag. ## **Run a GGUF model from Hugging Face** Any compatible GGUF repo on Hugging Face works. Copy the repo path and pass it to `geniex infer`: ```powershell windows theme={"dark"} geniex infer /-GGUF ``` For example: ```powershell windows theme={"dark"} geniex infer unsloth/Phi-4-mini-instruct-GGUF ``` When prompted: * **Model type** — `vlm` for vision-language models (Qwen3-VL family), `llm` otherwise. * **Precision (Quantization)** — `Q4_0` for best Hexagon NPU support. See [Precisions (Quantizations) Supported](#precisions-quantizations-supported). ### Set up a Hugging Face token Some models on Hugging Face are **gated** — you must accept the model's license on the Hugging Face website and provide an access token before GenieX can download them. Go to [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) and create a new token with **Read** access. Visit the model's page on Hugging Face (e.g. `https://huggingface.co//`) and accept the license/agreement if prompted. GenieX checks the following sources in order (first non-empty value wins): 1. `GENIEX_HFTOKEN` environment variable 2. `HF_TOKEN` environment variable 3. `~/.cache/huggingface/token` file **Option A — environment variable (recommended):** ```powershell windows theme={"dark"} # Standard HF variable (works with other HF tools too) $env:HF_TOKEN = "hf_..." # Or GenieX-specific (takes priority over HF_TOKEN) $env:GENIEX_HFTOKEN = "hf_..." ``` ```bash linux theme={"dark"} # Standard HF variable (works with other HF tools too) export HF_TOKEN="hf_..." # Or GenieX-specific (takes priority over HF_TOKEN) export GENIEX_HFTOKEN="hf_..." ``` **Option B — Hugging Face CLI login (persists to disk):** ```bash theme={"dark"} pip install huggingface_hub huggingface-cli login ``` This writes the token to `~/.cache/huggingface/token`, which GenieX reads automatically. `GENIEX_HFTOKEN` takes highest priority, then `HF_TOKEN`, then the cached token file. Use `GENIEX_HFTOKEN` if you need a separate token for GenieX without affecting other Hugging Face tools. ## **Run a local Qualcomm AI Engine Direct bundle** Bundles outside the `ai-hub-models/*` namespace — whether self-converted from Hugging Face or already sitting on disk — are imported once, then run like any other model. Register the bundle with `geniex pull --local-path`, then `geniex infer` it. **Self-converted from Hugging Face** Make sure the [Hugging Face CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli) is installed, then download a bundle and pull it: ```powershell windows theme={"dark"} # Set model name (must match the HF subfolder name exactly) $model = "llama_v3_2_3b_instruct" # Download the model from Hugging Face hf download yichqian/geniex-qairt-models ` --local-dir $model ` --include "$model/**" # Pull using the downloaded folder (absolute path) geniex pull local/$model --local-path (Resolve-Path "$model\$model").Path # Run inference geniex infer local/$model ``` ```bash linux theme={"dark"} # Set model name (must match the HF subfolder name exactly) model="llama_v3_2_3b_instruct" # Download the model from Hugging Face hf download yichqian/geniex-qairt-models \ --local-dir "$model" \ --include "$model/**" # Pull using the downloaded folder (absolute path) geniex pull "local/$model" --local-path "$(realpath "$model/$model")" # Run inference geniex infer "local/$model" ``` **Already on disk** Point `--local-path` at the extracted bundle directory (containing `.bin` shards and `metadata.json`): ```powershell windows theme={"dark"} geniex pull local/my-bundle --local-path C:\models\my-bundle geniex infer local/my-bundle ``` ```bash linux theme={"dark"} geniex pull local/my-bundle --local-path /home/user/models/my-bundle geniex infer local/my-bundle ``` You can also pull directly from an AI Hub `.zip` archive without extracting first: ```powershell windows theme={"dark"} geniex pull local/my-bundle --local-path C:\downloads\model.zip ``` ```bash linux theme={"dark"} geniex pull local/my-bundle --local-path /home/user/downloads/model.zip ``` `geniex pull` copies model files into its local cache. After a successful pull, you can safely delete the original download to reclaim disk space. Use `geniex list` to confirm the model is cached. Point `from_pretrained` at the bundle directory and force the Qualcomm AI Engine Direct runtime with `device_map="qairt"`: ```python theme={"dark"} from geniex import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( r"C:\\Qwen3-4B-Instruct-2507", # any .bin file or the bundle folder model_name="qwen3_4b_instruct_2507", # matches the Qualcomm AI Hub model id device_map="qairt", # required for Qualcomm AI Engine Direct ) ``` See the [Python API reference](/en/run/python/api-reference) for all parameters. Set `hub = HubSource.LOCALFS` and point `local_path` at an extracted AI Hub directory (`metadata.json` plus one or more `.bin` shards) or the AI Hub `.zip`. `pullFlow` imports it into the SDK cache (no network); no `chipset` is required — the bundle is already compiled for one. The modality (LLM vs VLM) is read from `metadata.json`. ```kotlin theme={"dark"} ModelManagerWrapper.pullFlow( ModelPullInput( model_name = "local/qwen3-4b-2507", hub = HubSource.LOCALFS, local_path = "/data/local/tmp/Qwen3-4B-Instruct-2507", // extracted dir or .zip ) ).collect { /* Progress / Completed / Error */ } ``` Then load with `runtime_id = "qairt"` — identical to the downloaded-model flow: ```kotlin theme={"dark"} val paths = ModelManagerWrapper.getPaths("local/qwen3-4b-2507") ?: error("Model not imported") val llm = LlmWrapper.builder() .llmCreateInput( LlmCreateInput( model_name = paths.model_name, model_path = paths.model_path, config = ModelConfig(max_tokens = 2048, enable_thinking = false), runtime_id = "qairt", compute_unit = null, // qairt is NPU-only ) ) .build() .getOrThrow() ``` The `model_name` is just the cache key — any `org/repo`-style string works; a `local/...` prefix is a common convention. You don't author a `geniex.json` for a local import — the model manager generates one during import. A QAIRT folder is recognized by `metadata.json` + `.bin` shards (or a `.zip`); a directory of loose `.bin` files with no `metadata.json` won't be detected. When importing models of mixed types, prefer `paths.runtime_id` (authoritative) over hard-coding `runtime_id`. ## **Run a local GGUF model** A local GGUF model is a directory (or file) holding your `.gguf` weights — side-loaded, produced by another tool, or already on disk. Import it once, then run it like any other model. Point `--local-path` at the directory containing your `.gguf` file: ```powershell windows theme={"dark"} geniex pull local/my-model --local-path C:\models\my-model geniex infer local/my-model ``` ```bash linux theme={"dark"} geniex pull local/my-model --local-path /home/user/models/my-model geniex infer local/my-model ``` Pass the path to a local `.gguf` file straight to `from_pretrained`: ```python theme={"dark"} from geniex import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( r"C:\models\my-model\model.gguf", # path to a local .gguf file device_map="auto", # auto -> hybrid for llama_cpp ) ``` See the [Python API reference](/en/run/python/api-reference) for all parameters. Set `hub = HubSource.LOCALFS` and point `local_path` at a directory containing the `.gguf` file(s). A `geniex.json` manifest is used if present; otherwise the layout is inferred from the file names. For a VLM, drop the `mmproj-*.gguf` into the same directory. ```kotlin theme={"dark"} ModelManagerWrapper.pullFlow( ModelPullInput( model_name = "local/qwen3-0.6b", hub = HubSource.LOCALFS, local_path = "/data/local/tmp/qwen3-0.6b", // dir containing *.gguf ) ).collect { /* Progress / Completed / Error */ } ``` Then load with `runtime_id = "llama_cpp"` — identical to the downloaded-model flow: ```kotlin theme={"dark"} val paths = ModelManagerWrapper.getPaths("local/qwen3-0.6b") ?: error("Model not imported") val llm = LlmWrapper.builder() .llmCreateInput( LlmCreateInput( model_name = paths.model_name, model_path = paths.model_path, config = ModelConfig(nCtx = 4096), runtime_id = "llama_cpp", compute_unit = null, // null → NPU on Snapdragon (recommended) ) ) .build() .getOrThrow() ``` ## **Precisions (Quantizations) Supported** Which precision you pick determines where the model runs. ### llama.cpp llama.cpp accepts the full range of GGML quantization formats. The CLI prompts you to pick one when you `geniex pull` a GGUF model: ```powershell theme={"dark"} Choose a precision version to download > Q4_0 [1.2 GiB] (default) Q8_0 [2.0 GiB] F16 [3.8 GiB] ``` | Precision (Quantization) | Runs on | Notes | | ------------------------ | --------------- | ------------------------------------------------------------ | | **`Q4_0`** *(default)* | **Hexagon NPU** | Best NPU support in llama.cpp. Recommended for most models. | | `Q8_0` | GPU / CPU | Better quality at \~2× the disk and memory cost. | | `F16` | GPU / CPU | Reference precision. Mainly for evaluation — large and slow. | | `Q4_K_M`, `Q5_K_M`, etc. | GPU / CPU | Mixed-precision K-quants. Not optimized for Hexagon NPU. | Stick with `Q4_0` if you want the model to land on the Hexagon NPU. Other precisions will work but typically run on GPU or CPU. ### Qualcomm AI Engine Direct Qualcomm AI Engine Direct bundles are **statically pre-quantized** — there is no precision choice at runtime. To use a different precision, get a different bundle from [Qualcomm AI Hub](https://aihub.qualcomm.com/models/). | Precision (Quantization) | Runs on | Notes | | --------------------------- | --------------- | ------------------------------------------------------------------------------------------------- | | **`w4a16`** *(most common)* | **Hexagon NPU** | Weights int4, activations int16. Best balance of size and accuracy for on-device LLMs. | | `w4` | **Hexagon NPU** | Weights int4, activations float. Slightly higher accuracy than w4a16 at the cost of more compute. | Most pre-built bundles on Qualcomm AI Hub use `w4a16`. The quantization level is baked into the bundle at compile time — pick the bundle that matches your quality/performance target. For other bundle constraints (context length, KV cache, Android `nCtx`), see [Qualcomm AI Engine Direct runtime constraints](/en/get-started/platforms#runtime-constraints).

Was this page helpful?

Yes