Skip to main content

Prerequisites

  • The Python SDK installed — see Install.
  • Familiarity with runtime choiceqairt for Qualcomm AI Hub Models, llama_cpp for any GGUF.
The SDK follows the same design as Hugging Face transformers — load with AutoModelForCausalLM.from_pretrained(), then call .generate().

LLM inference (GGUF)

Any GGUF model from Hugging Face runs via llama_cpp. Model weights are downloaded on first use.
from geniex import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-0.6B-GGUF",     # HF repo id of a GGUF model, or a local .gguf path
    device_map="auto",          # "auto" | "cpu" | "gpu" | "npu" | "hybrid"
                                # | "<runtime>" | "<runtime>:<compute-unit>"
                                # auto -> hybrid for llama_cpp, npu for qairt
)

messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = model.tokenizer.apply_chat_template(
    messages, add_generation_prompt=True,
)

# One-shot
output = model.generate(prompt, max_new_tokens=256)
print(output.text)
print(f"[{output.profile.generated_tokens} tok, "
      f"{output.profile.decode_speed:.1f} tok/s, stop={output.profile.stop_reason}]")

# Streaming
streamer = model.generate(prompt, max_new_tokens=256, stream=True)
for chunk in streamer:
    print(chunk, end="", flush=True)

model.close()

LLM inference (QAIRT)

Pre-compiled bundles from Qualcomm AI Hub run entirely on the Hexagon NPU via the qairt runtime. Use device_map="qairt" (or "npu"). Model weights are downloaded on first use.
from geniex import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "ai-hub-models/Qwen3-4B",   # Qualcomm AI Hub model id
    device_map="qairt",         # NPU-only
)

messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = model.tokenizer.apply_chat_template(
    messages, add_generation_prompt=True,
)

# One-shot
output = model.generate(prompt, max_new_tokens=256)
print(output.text)
print(f"[{output.profile.generated_tokens} tok, "
      f"{output.profile.decode_speed:.1f} tok/s, stop={output.profile.stop_reason}]")

# Streaming
streamer = model.generate(prompt, max_new_tokens=256, stream=True)
for chunk in streamer:
    print(chunk, end="", flush=True)

model.close()

VLM inference (QAIRT)

Download a sample image first:
curl -o demo.jpg https://qaihub-public-assets.s3.us-west-2.amazonaws.com/qai-hub-geniex/demo.jpg
Then run inference:
import os
from geniex import AutoModelForCausalLM

image_path = os.path.abspath("demo.jpg")

model = AutoModelForCausalLM.from_pretrained(
    "ai-hub-models/Qwen2.5-VL-7B-Instruct",  # Qualcomm AI Hub VLM bundle
    device_map="qairt",
)
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image_path},
        {"type": "text", "text": "Describe the image."},
    ],
}]
prompt = model.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
)

streamer = model.generate(prompt, images=[image_path], max_new_tokens=256, stream=True)
for chunk in streamer:
    print(chunk, end="", flush=True)

model.close()

Jupyter notebook walkthrough

For laptop users, follow the step-by-step Jupyter notebook at examples/python/windows.ipynb — it covers environment setup and inference end-to-end.

Next steps

API reference

All classes, methods, and parameters for the Python SDK.

Models

Supported models, GGUF on Hugging Face, and self-converted Qualcomm AI Engine Direct bundles.