API reference - Qualcomm® AI Hub GenieX

The GenieX Python SDK follows the same design patterns as Hugging Face transformers — use AutoModel*.from_pretrained() to load a model, then call .generate() for inference.

AutoModelForCausalLM

Factory for loading causal language models — both text-only and multimodal. Returns a GenieXLLM for text-only models, or a GenieXVLM when a multimodal model is detected (e.g. phi4_multimodal, qwen3.5-vl, gemma4).

from geniex import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "ai-hub-models/Qwen3-4B-Instruct",
    device_map="auto",
)

`from_pretrained()`

Parameter	Type	Default	Description
`model_name_or_path`	`str`	required	HuggingFace repo id, short alias (e.g. `"qwen3"`), or local path.
`device_map`	`str`	`"auto"`	`"auto"` picks the first available runtime + compute unit. Also accepts `"<runtime>"` (runtime) or `"<runtime>:<compute_unit>"` (runtime + compute unit).
`precision`	`str \| None`	`None`	Precision (quantization) variant (e.g. `"Q4_K_M"`). Filters files when downloading from Hub.
`mmproj_path`	`str \| None`	`None`	Path to the multimodal projector file. Auto-resolved from Hub; pass explicitly to force VLM mode.

Show Additional parameters

Parameter	Type	Default	Description
`model_name`	`str \| None`	`None`	Override the registry model name (e.g. `"granite4"` for Qualcomm AI Engine Direct).
`hf_token`	`str \| None`	`None`	HuggingFace bearer token for gated models.
`max_tokens`	`int`	—	Maximum tokens for generation.

Returns: GenieXLLM or GenieXVLM (auto-detected based on the model)

AutoModelForVision2Seq

Factory for loading vision-language / multimodal models. Returns a GenieXVLM instance.

from geniex import AutoModelForVision2Seq

model = AutoModelForVision2Seq.from_pretrained("ai-hub-models/Qwen2.5-7B-Instruct", device_map="qairt")

`from_pretrained()`

Accepts all the same parameters as AutoModelForCausalLM.from_pretrained() plus:

Parameter	Type	Default	Description
`mmproj_path`	`str \| None`	`None`	Path to the multimodal projector file. Auto-resolved when downloading from Hub.

Returns: GenieXVLM

GenieXLLM

Text-only language model instance returned by AutoModelForCausalLM.from_pretrained().

`generate()`

Run text generation from a formatted prompt string.

output = model.generate(prompt, max_new_tokens=256)
print(output.text)

Parameter	Type	Default	Description
`prompt`	`str`	required	The formatted prompt string (use `model.tokenizer.apply_chat_template()` to build it).
`max_new_tokens`	`int`	`512`	Maximum number of tokens to generate.
`temperature`	`float`	`0.7`	Sampling temperature.
`top_p`	`float`	`0.9`	Nucleus sampling threshold.
`top_k`	`int`	`40`	Top-k sampling.
`min_p`	`float`	`0.0`	Minimum probability threshold.
`stream`	`bool`	`False`	If `True`, returns a `TextIteratorStreamer` instead.

Show Penalty and advanced parameters

Parameter	Type	Default	Description
`repetition_penalty`	`float`	`1.1`	Repetition penalty.
`presence_penalty`	`float`	`0.0`	Presence penalty.
`frequency_penalty`	`float`	`0.0`	Frequency penalty.
`seed`	`int`	`-1`	Random seed (`-1` = random).
`stop`	`list[str] \| None`	`None`	Stop sequences.
`grammar`	`str \| None`	`None`	GBNF grammar string for constrained generation.
`json_mode`	`bool`	`False`	Force JSON output.

Returns: GenerateOutput (or TextIteratorStreamer when stream=True)

`reset()`

Resets conversation state and clears the KV cache.

`save_kv_cache(path)` / `load_kv_cache(path)`

Save or load the key-value cache to/from a file path (str).

`close()`

Releases the model handle and frees resources. Also supports context-manager usage:

with AutoModelForCausalLM.from_pretrained("qwen3") as model:
    output = model.generate(prompt)

GenieXVLM

Vision-language model instance returned by AutoModelForVision2Seq.from_pretrained().

`generate()`

Same parameters as GenieXLLM.generate() plus:

Parameter	Type	Default	Description
`images`	`list[str] \| None`	`None`	List of image file paths for the model to process.
`audios`	`list[str] \| None`	`None`	List of audio file paths for the model to process.

output = model.generate(prompt, images=["/path/to/image.jpg"], max_new_tokens=256)
print(output.text)

Returns: GenerateOutput (or TextIteratorStreamer when stream=True)

`reset()` / `close()`

Same as GenieXLLM.

ModelTokenizer

Accessed via model.tokenizer. Provides a transformers-compatible chat template interface.

`apply_chat_template()`

Formats a list of chat messages using the model’s built-in chat template.

messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = model.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
)

Parameter	Type	Default	Description
`messages`	`list[dict]`	required	List of message dicts with `"role"` and `"content"` keys.
`tokenize`	`bool`	`False`	Must be `False` — standalone tokenization is not supported.
`add_generation_prompt`	`bool`	`True`	Whether to append the generation prompt suffix.
`enable_thinking`	`bool \| None`	`None`	Enable thinking mode. `None` (default) and `True` both let a thinking-capable model think (and are no-ops on non-thinking models). `False` asks a thinking-capable model to skip its thinking turn — forced to `True` with a warning on non-thinking models, where the suppression block is OOD. Capability is auto-detected and exposed as `model.supports_thinking`.
`tools`	`list[dict] \| str \| None`	`None`	Tool definitions as a list of dicts or pre-serialised JSON string.

Returns: str — formatted prompt ready for model.generate().

Output classes

GenerateOutput

Returned by model.generate().

Attribute	Type	Description
`text`	`str`	The generated text (thinking tags stripped if present).
`thinking`	`str \| None`	The model’s reasoning content, or `None`.
`profile`	`ProfileData`	Performance metrics.

ProfileData

Attribute	Type	Description
`ttft`	`int`	Time to first token (ms).
`prompt_tokens`	`int`	Number of prompt tokens.
`generated_tokens`	`int`	Number of generated tokens.
`prefill_speed`	`float`	Prefill speed (tokens/s).
`decode_speed`	`float`	Decode speed (tokens/s).
`stop_reason`	`str \| None`	Why generation stopped (e.g. `"eos"`, `"limit"`).

Show All timing fields

Attribute	Type	Description
`prompt_time`	`int`	Prompt processing time (ms).
`decode_time`	`int`	Decode time (ms).

TextIteratorStreamer

Returned by model.generate(..., stream=True). Yields decoded text chunks as they are generated.

streamer = model.generate(prompt, max_new_tokens=256, stream=True)
for chunk in streamer:
    print(chunk, end="", flush=True)

final = streamer.output  # GenerateOutput available after iteration

Method / Property	Description
`__iter__()`	Yields `str` chunks as they are generated.
`output`	`GenerateOutput \| None` — available after iteration finishes.
`cancel()`	Stop generation at the next token boundary.

Model manager

The same model manager the CLI uses is available programmatically via geniex.model_manager.

from geniex import model_manager as mm

MODEL = "Qwen/Qwen3-0.6B-GGUF"
mm.pull(MODEL)
print(f"pull complete: {MODEL}")
paths = mm.get_paths(MODEL)
print(f"model path:    {paths}")
local_models = mm.list_models()
print(f"local models:  {local_models}")
mm.remove(MODEL)
print(f"model removed: {MODEL}")

Function	Description
`pull(model_name, ...)`	Download a model by alias or `org/repo[:precision]`.
`list_models()`	Returns `list[str]` of cached model names.
`get_paths(model_name)`	Returns `ModelPaths` with resolved local file paths.
`get_type(model_name)`	Returns `"llm"` or `"vlm"`.
`resolve_alias(alias)`	Resolves a short alias to canonical `org/repo`.
`remove(model_name)`	Delete a cached model from disk.
`clean()`	Remove all cached models. Returns count removed.

Show pull() parameters

Parameter	Type	Default	Description
`model_name`	`str`	required	`org/repo` or a short alias.
`precision`	`str \| None`	`None`	Precision (quantization) hint (e.g. `"Q4_K_M"`).
`hub`	`str`	`"auto"`	`"auto"` \| `"hf"` \| `"localfs"`.
`local_path`	`str \| None`	`None`	Source directory (required when `hub="localfs"`).
`hf_token`	`str \| None`	`None`	HuggingFace bearer token.
`on_progress`	`Callable \| None`	`None`	Callback `(files: list[FileProgress]) -> bool`; return `False` to cancel.

Show ModelPaths attributes

Attribute	Type	Description
`model_path`	`str`	Path to the main model file.
`model_dir`	`str`	Directory containing the model.
`model_name`	`str`	Canonical model name.
`runtime`	`str`	Runtime identifier.
`mmproj_path`	`str \| None`	Multimodal projector path (VLM only).
`tokenizer_path`	`str \| None`	Tokenizer file path.
`compute_unit`	`str \| None`	Compute unit identifier.

SDK functions

Function	Description
`geniex.init()`	Initialize the SDK. Called automatically on first model load.
`geniex.deinit()`	Shut down the SDK and release resources.
`geniex.version()`	Returns the SDK version string.
`geniex.get_runtime_list()`	Returns `list[str]` of available runtime IDs.
`geniex.get_compute_unit_list(runtime)`	Returns `list[tuple[str, str]]` of `(compute_unit, compute_unit_name)` pairs for the given runtime.

Was this page helpful?

Yes

​AutoModelForCausalLM

​from_pretrained()

​AutoModelForVision2Seq

​from_pretrained()

​GenieXLLM

​generate()

​reset()

​save_kv_cache(path) / load_kv_cache(path)

​close()

​GenieXVLM

​generate()

​reset() / close()

​ModelTokenizer

​apply_chat_template()

​Output classes

​GenerateOutput

​ProfileData

​TextIteratorStreamer

​Model manager

​SDK functions

AutoModelForCausalLM

`from_pretrained()`

AutoModelForVision2Seq

`from_pretrained()`

GenieXLLM

`generate()`

`reset()`

`save_kv_cache(path)` / `load_kv_cache(path)`

`close()`

GenieXVLM

`generate()`

`reset()` / `close()`

ModelTokenizer

`apply_chat_template()`

Output classes

GenerateOutput

ProfileData

TextIteratorStreamer

Model manager

SDK functions