transformers — use AutoModel*.from_pretrained() to load a model, then call .generate() for inference.
AutoModelForCausalLM
Factory for loading causal language models — both text-only and multimodal. Returns aGenieXLLM for text-only models, or a GenieXVLM when a multimodal model is detected (e.g. phi4_multimodal, qwen3.5-vl, gemma4).
from_pretrained()
| Parameter | Type | Default | Description |
|---|---|---|---|
model_name_or_path | str | required | HuggingFace repo id, short alias (e.g. "qwen3"), or local path. |
device_map | str | "auto" | "auto" picks the first available runtime + compute unit. Also accepts "<runtime>" (runtime) or "<runtime>:<compute_unit>" (runtime + compute unit). |
precision | str | None | None | Precision (quantization) variant (e.g. "Q4_K_M"). Filters files when downloading from Hub. |
mmproj_path | str | None | None | Path to the multimodal projector file. Auto-resolved from Hub; pass explicitly to force VLM mode. |
GenieXLLM or GenieXVLM (auto-detected based on the model)
AutoModelForVision2Seq
Factory for loading vision-language / multimodal models. Returns aGenieXVLM instance.
from_pretrained()
Accepts all the same parameters as AutoModelForCausalLM.from_pretrained() plus:
| Parameter | Type | Default | Description |
|---|---|---|---|
mmproj_path | str | None | None | Path to the multimodal projector file. Auto-resolved when downloading from Hub. |
GenieXVLM
GenieXLLM
Text-only language model instance returned byAutoModelForCausalLM.from_pretrained().
generate()
Run text generation from a formatted prompt string.
| Parameter | Type | Default | Description |
|---|---|---|---|
prompt | str | required | The formatted prompt string (use model.tokenizer.apply_chat_template() to build it). |
max_new_tokens | int | 512 | Maximum number of tokens to generate. |
temperature | float | 0.7 | Sampling temperature. |
top_p | float | 0.9 | Nucleus sampling threshold. |
top_k | int | 40 | Top-k sampling. |
min_p | float | 0.0 | Minimum probability threshold. |
stream | bool | False | If True, returns a TextIteratorStreamer instead. |
GenerateOutput (or TextIteratorStreamer when stream=True)
reset()
Resets conversation state and clears the KV cache.
save_kv_cache(path) / load_kv_cache(path)
Save or load the key-value cache to/from a file path (str).
close()
Releases the model handle and frees resources. Also supports context-manager usage:
GenieXVLM
Vision-language model instance returned byAutoModelForVision2Seq.from_pretrained().
generate()
Same parameters as GenieXLLM.generate() plus:
| Parameter | Type | Default | Description |
|---|---|---|---|
images | list[str] | None | None | List of image file paths for the model to process. |
audios | list[str] | None | None | List of audio file paths for the model to process. |
GenerateOutput (or TextIteratorStreamer when stream=True)
reset() / close()
Same as GenieXLLM.
ModelTokenizer
Accessed viamodel.tokenizer. Provides a transformers-compatible chat template interface.
apply_chat_template()
Formats a list of chat messages using the model’s built-in chat template.
| Parameter | Type | Default | Description |
|---|---|---|---|
messages | list[dict] | required | List of message dicts with "role" and "content" keys. |
tokenize | bool | False | Must be False — standalone tokenization is not supported. |
add_generation_prompt | bool | True | Whether to append the generation prompt suffix. |
enable_thinking | bool | None | None | Enable thinking mode. None (default) and True both let a thinking-capable model think (and are no-ops on non-thinking models). False asks a thinking-capable model to skip its thinking turn — forced to True with a warning on non-thinking models, where the suppression block is OOD. Capability is auto-detected and exposed as model.supports_thinking. |
tools | list[dict] | str | None | None | Tool definitions as a list of dicts or pre-serialised JSON string. |
str — formatted prompt ready for model.generate().
Output classes
GenerateOutput
Returned bymodel.generate().
| Attribute | Type | Description |
|---|---|---|
text | str | The generated text (thinking tags stripped if present). |
thinking | str | None | The model’s reasoning content, or None. |
profile | ProfileData | Performance metrics. |
ProfileData
| Attribute | Type | Description |
|---|---|---|
ttft | int | Time to first token (ms). |
prompt_tokens | int | Number of prompt tokens. |
generated_tokens | int | Number of generated tokens. |
prefill_speed | float | Prefill speed (tokens/s). |
decode_speed | float | Decode speed (tokens/s). |
stop_reason | str | None | Why generation stopped (e.g. "eos", "limit"). |
TextIteratorStreamer
Returned bymodel.generate(..., stream=True). Yields decoded text chunks as they are generated.
| Method / Property | Description |
|---|---|
__iter__() | Yields str chunks as they are generated. |
output | GenerateOutput | None — available after iteration finishes. |
cancel() | Stop generation at the next token boundary. |
Model manager
The same model manager the CLI uses is available programmatically viageniex.model_manager.
| Function | Description |
|---|---|
pull(model_name, ...) | Download a model by alias or org/repo[:precision]. |
list_models() | Returns list[str] of cached model names. |
get_paths(model_name) | Returns ModelPaths with resolved local file paths. |
get_type(model_name) | Returns "llm" or "vlm". |
resolve_alias(alias) | Resolves a short alias to canonical org/repo. |
remove(model_name) | Delete a cached model from disk. |
clean() | Remove all cached models. Returns count removed. |
SDK functions
| Function | Description |
|---|---|
geniex.init() | Initialize the SDK. Called automatically on first model load. |
geniex.deinit() | Shut down the SDK and release resources. |
geniex.version() | Returns the SDK version string. |
geniex.get_runtime_list() | Returns list[str] of available runtime IDs. |
geniex.get_compute_unit_list(runtime) | Returns list[tuple[str, str]] of (compute_unit, compute_unit_name) pairs for the given runtime. |
Was this page helpful?