Skip to main content
The GenieX Python SDK follows the same design patterns as Hugging Face transformers — use AutoModel*.from_pretrained() to load a model, then call .generate() for inference.

AutoModelForCausalLM

Factory for loading causal language models — both text-only and multimodal. Returns a GenieXLLM for text-only models, or a GenieXVLM when a multimodal model is detected (e.g. phi4_multimodal, qwen3.5-vl, gemma4).
from geniex import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "ai-hub-models/Qwen3-4B-Instruct",
    device_map="auto",
)

from_pretrained()

ParameterTypeDefaultDescription
model_name_or_pathstrrequiredHuggingFace repo id, short alias (e.g. "qwen3"), or local path.
device_mapstr"auto""auto" picks the first available runtime + compute unit. Also accepts "<runtime>" (runtime) or "<runtime>:<compute_unit>" (runtime + compute unit).
precisionstr | NoneNonePrecision (quantization) variant (e.g. "Q4_K_M"). Filters files when downloading from Hub.
mmproj_pathstr | NoneNonePath to the multimodal projector file. Auto-resolved from Hub; pass explicitly to force VLM mode.
Returns: GenieXLLM or GenieXVLM (auto-detected based on the model)

AutoModelForVision2Seq

Factory for loading vision-language / multimodal models. Returns a GenieXVLM instance.
from geniex import AutoModelForVision2Seq

model = AutoModelForVision2Seq.from_pretrained("ai-hub-models/Qwen2.5-7B-Instruct", device_map="qairt")

from_pretrained()

Accepts all the same parameters as AutoModelForCausalLM.from_pretrained() plus:
ParameterTypeDefaultDescription
mmproj_pathstr | NoneNonePath to the multimodal projector file. Auto-resolved when downloading from Hub.
Returns: GenieXVLM

GenieXLLM

Text-only language model instance returned by AutoModelForCausalLM.from_pretrained().

generate()

Run text generation from a formatted prompt string.
output = model.generate(prompt, max_new_tokens=256)
print(output.text)
ParameterTypeDefaultDescription
promptstrrequiredThe formatted prompt string (use model.tokenizer.apply_chat_template() to build it).
max_new_tokensint512Maximum number of tokens to generate.
temperaturefloat0.7Sampling temperature.
top_pfloat0.9Nucleus sampling threshold.
top_kint40Top-k sampling.
min_pfloat0.0Minimum probability threshold.
streamboolFalseIf True, returns a TextIteratorStreamer instead.
Returns: GenerateOutput (or TextIteratorStreamer when stream=True)

reset()

Resets conversation state and clears the KV cache.

save_kv_cache(path) / load_kv_cache(path)

Save or load the key-value cache to/from a file path (str).

close()

Releases the model handle and frees resources. Also supports context-manager usage:
with AutoModelForCausalLM.from_pretrained("qwen3") as model:
    output = model.generate(prompt)

GenieXVLM

Vision-language model instance returned by AutoModelForVision2Seq.from_pretrained().

generate()

Same parameters as GenieXLLM.generate() plus:
ParameterTypeDefaultDescription
imageslist[str] | NoneNoneList of image file paths for the model to process.
audioslist[str] | NoneNoneList of audio file paths for the model to process.
output = model.generate(prompt, images=["/path/to/image.jpg"], max_new_tokens=256)
print(output.text)
Returns: GenerateOutput (or TextIteratorStreamer when stream=True)

reset() / close()

Same as GenieXLLM.

ModelTokenizer

Accessed via model.tokenizer. Provides a transformers-compatible chat template interface.

apply_chat_template()

Formats a list of chat messages using the model’s built-in chat template.
messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = model.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
)
ParameterTypeDefaultDescription
messageslist[dict]requiredList of message dicts with "role" and "content" keys.
tokenizeboolFalseMust be False — standalone tokenization is not supported.
add_generation_promptboolTrueWhether to append the generation prompt suffix.
enable_thinkingbool | NoneNoneEnable thinking mode. None (default) and True both let a thinking-capable model think (and are no-ops on non-thinking models). False asks a thinking-capable model to skip its thinking turn — forced to True with a warning on non-thinking models, where the suppression block is OOD. Capability is auto-detected and exposed as model.supports_thinking.
toolslist[dict] | str | NoneNoneTool definitions as a list of dicts or pre-serialised JSON string.
Returns: str — formatted prompt ready for model.generate().

Output classes

GenerateOutput

Returned by model.generate().
AttributeTypeDescription
textstrThe generated text (thinking tags stripped if present).
thinkingstr | NoneThe model’s reasoning content, or None.
profileProfileDataPerformance metrics.

ProfileData

AttributeTypeDescription
ttftintTime to first token (ms).
prompt_tokensintNumber of prompt tokens.
generated_tokensintNumber of generated tokens.
prefill_speedfloatPrefill speed (tokens/s).
decode_speedfloatDecode speed (tokens/s).
stop_reasonstr | NoneWhy generation stopped (e.g. "eos", "limit").

TextIteratorStreamer

Returned by model.generate(..., stream=True). Yields decoded text chunks as they are generated.
streamer = model.generate(prompt, max_new_tokens=256, stream=True)
for chunk in streamer:
    print(chunk, end="", flush=True)

final = streamer.output  # GenerateOutput available after iteration
Method / PropertyDescription
__iter__()Yields str chunks as they are generated.
outputGenerateOutput | None — available after iteration finishes.
cancel()Stop generation at the next token boundary.

Model manager

The same model manager the CLI uses is available programmatically via geniex.model_manager.
from geniex import model_manager as mm

MODEL = "Qwen/Qwen3-0.6B-GGUF"
mm.pull(MODEL)
print(f"pull complete: {MODEL}")
paths = mm.get_paths(MODEL)
print(f"model path:    {paths}")
local_models = mm.list_models()
print(f"local models:  {local_models}")
mm.remove(MODEL)
print(f"model removed: {MODEL}")
FunctionDescription
pull(model_name, ...)Download a model by alias or org/repo[:precision].
list_models()Returns list[str] of cached model names.
get_paths(model_name)Returns ModelPaths with resolved local file paths.
get_type(model_name)Returns "llm" or "vlm".
resolve_alias(alias)Resolves a short alias to canonical org/repo.
remove(model_name)Delete a cached model from disk.
clean()Remove all cached models. Returns count removed.

SDK functions

FunctionDescription
geniex.init()Initialize the SDK. Called automatically on first model load.
geniex.deinit()Shut down the SDK and release resources.
geniex.version()Returns the SDK version string.
geniex.get_runtime_list()Returns list[str] of available runtime IDs.
geniex.get_compute_unit_list(runtime)Returns list[tuple[str, str]] of (compute_unit, compute_unit_name) pairs for the given runtime.