> ## Documentation Index > Fetch the complete documentation index at: https://geniex.aihub.qualcomm.com/llms.txt > Use this file to discover all available pages before exploring further. # Quickstart > Run your first model from the GenieX Python SDK on Windows ARM64. ## **Prerequisites** * The Python SDK installed — see [Install](/en/run/python/install). * Familiarity with [runtime choice](/en/get-started/platforms#geniex-runtimes) — `qairt` for Qualcomm AI Hub Models, `llama_cpp` for any GGUF. The SDK follows the same design as Hugging Face `transformers` — load with `AutoModelForCausalLM.from_pretrained()`, then call `.generate()`. ## **LLM inference (GGUF)** Any GGUF model from Hugging Face runs via `llama_cpp`. Model weights are downloaded on first use. ```python theme={"dark"} from geniex import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen3-0.6B-GGUF", # HF repo id of a GGUF model, or a local .gguf path device_map="auto", # "auto" | "cpu" | "gpu" | "npu" | "hybrid" # | "" | ":" # auto -> hybrid for llama_cpp, npu for qairt ) messages = [{"role": "user", "content": "What is 2+2?"}] prompt = model.tokenizer.apply_chat_template( messages, add_generation_prompt=True, ) # One-shot output = model.generate(prompt, max_new_tokens=256) print(output.text) print(f"[{output.profile.generated_tokens} tok, " f"{output.profile.decode_speed:.1f} tok/s, stop={output.profile.stop_reason}]") # Streaming streamer = model.generate(prompt, max_new_tokens=256, stream=True) for chunk in streamer: print(chunk, end="", flush=True) model.close() ``` ## **LLM inference (QAIRT)** Pre-compiled bundles from [Qualcomm AI Hub](https://aihub.qualcomm.com/models/) run entirely on the Hexagon NPU via the `qairt` runtime. Use `device_map="qairt"` (or `"npu"`). Model weights are downloaded on first use. ```python theme={"dark"} from geniex import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( "ai-hub-models/Qwen3-4B", # Qualcomm AI Hub model id device_map="qairt", # NPU-only ) messages = [{"role": "user", "content": "What is 2+2?"}] prompt = model.tokenizer.apply_chat_template( messages, add_generation_prompt=True, ) # One-shot output = model.generate(prompt, max_new_tokens=256) print(output.text) print(f"[{output.profile.generated_tokens} tok, " f"{output.profile.decode_speed:.1f} tok/s, stop={output.profile.stop_reason}]") # Streaming streamer = model.generate(prompt, max_new_tokens=256, stream=True) for chunk in streamer: print(chunk, end="", flush=True) model.close() ``` ## **VLM inference (QAIRT)** Download a sample image first: ```bash theme={"dark"} curl -o demo.jpg https://qaihub-public-assets.s3.us-west-2.amazonaws.com/qai-hub-geniex/demo.jpg ``` Then run inference: ```python theme={"dark"} import os from geniex import AutoModelForCausalLM image_path = os.path.abspath("demo.jpg") model = AutoModelForCausalLM.from_pretrained( "ai-hub-models/Qwen2.5-VL-7B-Instruct", # Qualcomm AI Hub VLM bundle device_map="qairt", ) messages = [{ "role": "user", "content": [ {"type": "image", "image": image_path}, {"type": "text", "text": "Describe the image."}, ], }] prompt = model.tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) streamer = model.generate(prompt, images=[image_path], max_new_tokens=256, stream=True) for chunk in streamer: print(chunk, end="", flush=True) model.close() ``` ## **Jupyter notebook walkthrough** For laptop users, follow the step-by-step Jupyter notebook at [examples/python/windows.ipynb](https://github.com/qualcomm/GenieX/blob/main/examples/python/windows.ipynb) — it covers environment setup and inference end-to-end. ## **Next steps** All classes, methods, and parameters for the Python SDK. Supported models, GGUF on Hugging Face, and self-converted Qualcomm AI Engine Direct bundles.

Was this page helpful?

Yes