Skip to main content

Prerequisites

  • The CLI installed — see Install.
  • Interactive shell from container (Docker only) — see Run interactively.
  • Familiarity with runtime choiceqairt (Qualcomm AI Engine Direct) for Qualcomm AI Hub Models, llama_cpp for any GGUF.

Run your first model

Qualcomm AI Engine Direct runtime (Qualcomm AI Hub)

Language model:
windows
geniex infer ai-hub-models/Qwen3-4B
Multimodal model:
windows
geniex infer ai-hub-models/Qwen2.5-VL-7B-Instruct

llama.cpp runtime (GGUF)

Pick Q4_0 when prompted — it has the best Hexagon NPU support. Language model:
windows
geniex infer unsloth/Qwen3.5-0.8B-GGUF
Multimodal model:
windows
geniex infer Qwen/Qwen3-VL-2B-Instruct-GGUF
When prompted:
  • Model typevlm for vision-language models, llm for text-only models. For Qwen3.5 and Gemma4, pick llm for now (multimodal support coming soon).
  • Precision (Quantization)Q4_0 for best Hexagon NPU performance.
To try other GGUF models, copy any compatible GGUF path from Hugging Face and substitute it into the command above. See Run a GGUF model from Hugging Face.

Run a local model

Already have a model on disk, or want to self-convert a bundle from Hugging Face? Use geniex pull with --local-path to register it, then run it like any other model. See:

Next steps

Local server

Expose an OpenAI-compatible HTTP API on localhost:18181.

CLI reference

Every command, every flag.