Quickstart - Qualcomm® AI Hub GenieX

Prerequisites

The CLI installed — see Install.
Interactive shell from container (Docker only) — see Run interactively.
Familiarity with runtime choice — qairt (Qualcomm AI Engine Direct) for Qualcomm AI Hub Models, llama_cpp for any GGUF.

Run your first model

Qualcomm AI Engine Direct runtime (Qualcomm AI Hub)

Language model:

windows

geniex infer ai-hub-models/Qwen3-4B

Multimodal model:

windows

geniex infer ai-hub-models/Qwen2.5-VL-7B-Instruct

llama.cpp runtime (GGUF)

Pick Q4_0 when prompted — it has the best Hexagon NPU support. Language model:

windows

geniex infer unsloth/Qwen3.5-0.8B-GGUF

Multimodal model:

windows

geniex infer Qwen/Qwen3-VL-2B-Instruct-GGUF

When prompted:

Model type — vlm for vision-language models, llm for text-only models. For Qwen3.5 and Gemma4, pick llm for now (multimodal support coming soon).
Precision (Quantization) — Q4_0 for best Hexagon NPU performance.

To try other GGUF models, copy any compatible GGUF path from Hugging Face and substitute it into the command above. See Run a GGUF model from Hugging Face.

Run a local model

Already have a model on disk, or want to self-convert a bundle from Hugging Face? Use geniex pull with --local-path to register it, then run it like any other model. See:

Run a local Qualcomm AI Engine Direct bundle — self-converted from Hugging Face, an extracted bundle directory, or an AI Hub .zip.
Run a local GGUF model — a directory containing your .gguf file.

Next steps

Local server

Expose an OpenAI-compatible HTTP API on localhost:18181.

CLI reference

Every command, every flag.

Was this page helpful?

Yes

​Prerequisites

​Run your first model

​Qualcomm AI Engine Direct runtime (Qualcomm AI Hub)

​llama.cpp runtime (GGUF)

​Run a local model

​Next steps

Local server

CLI reference

Prerequisites

Run your first model

Qualcomm AI Engine Direct runtime (Qualcomm AI Hub)

llama.cpp runtime (GGUF)

Run a local model

Next steps