Model inference
geniex pull
Download a model and store it locally.
geniex pull <model-name>[:<precision>]
| Flag | Description |
|---|
--model-hub | Model source: aihub | hf | localfs. Auto-detected when omitted. |
--local-path | Path to a local directory or AI Hub .zip file. Implies --model-hub localfs. |
--model-type | Model type: llm | vlm. Auto-detected when omitted. |
Pulling from a local path:
geniex pull local/my-model --local-path /path/to/model-dir
pull copies files into the GenieX cache. After a successful pull you can safely delete the source to avoid keeping two copies.
Precision (Quantization) (llama.cpp only)
For GGUF models the CLI prompts you to pick a precision:
Choose a precision version to download
> Q4_0 [1.2 GiB] (default)
Q8_0 [2.0 GiB]
F16 [3.8 GiB]
Qualcomm AI Hub Models are pre-quantized — no choice needed.
geniex infer — LLM
Launch an interactive chat session with a language model.
geniex infer ai-hub-models/Qwen3-4B
Thinking mode — control whether the model shows reasoning before responding:
geniex infer ai-hub-models/Qwen3-4B --think # show reasoning steps
geniex infer ai-hub-models/Qwen3-4B --think=false # respond directly
Compute unit selection (via --compute) — pick which compute unit runs the model (default: npu):
# llama.cpp models support all compute units
geniex infer unsloth/Qwen3.5-0.8B-GGUF --compute npu
geniex infer unsloth/Qwen3.5-0.8B-GGUF --compute gpu
geniex infer unsloth/Qwen3.5-0.8B-GGUF --compute cpu
# Qualcomm AI Hub Models only support NPU
geniex infer ai-hub-models/Qwen3-4B --compute npu
Qualcomm AI Hub Models run on NPU only. Using --compute cpu or --compute gpu returns an error.
geniex infer — VLM
Run vision-language inference with text-only or image input:
geniex infer ai-hub-models/Qwen2.5-VL-7B-Instruct-GGUF
For text-only, just launch and chat. For image input, provide the absolute path or drag the file into your terminal:
Describe this picture </full/path/to/image.png>
geniex serve
Start the OpenAI-compatible local server. See Local server for the API.
Configuration flags
These flags can be passed to geniex infer to control model loading and generation.
Sampler flags
Control how the model selects tokens during generation.
| Flag | Type | Default | Description |
|---|
--temperature | float | — | Sampling temperature. Higher values increase randomness. |
--top-p | float | — | Top-p (nucleus) sampling threshold. |
--top-k | int | — | Top-k sampling. Only consider the top-k most likely tokens. |
--min-p | float | — | Min-p sampling threshold. |
--repetition-penalty | float | 1 | Penalize repeated tokens. Values > 1 reduce repetition. |
--presence-penalty | float | — | Penalize tokens that have appeared at all. |
--frequency-penalty | float | — | Penalize tokens proportional to their frequency. |
--seed | int | — | Random seed for reproducible outputs. |
--grammar-path | string | — | Path to a GBNF grammar file for constrained generation. |
--grammar-string | string | — | Inline grammar in GBNF string format. |
--enable-json | — | — | Force JSON-only output. |
Model flags
Control model loading, context, and generation limits.
| Flag | Type | Default | Description |
|---|
-n, --ngl | int | 999 | Number of layers to offload to GPU. |
--nctx | int | 4096 | Context window size (max input + output tokens). |
--max-tokens | int | 2048 | Maximum tokens to generate per response. |
--stop | string[] | — | Stop sequences (can be specified multiple times). |
--stop-file | string | — | File containing stop sequences (one per line). |
--think / --think=false | bool | true | Enable or disable thinking mode for reasoning models. |
-s, --system-prompt | string | — | System prompt to set model behavior. |
Utility commands
| Command | Description | Example |
|---|
geniex list | Display all downloaded models with their names and sizes. | geniex list |
geniex remove <model> | Remove a specific local model by name. | geniex remove unsloth/Qwen3-0.6B-GGUF |
geniex clean | Delete all locally cached models. | geniex clean |
geniex infer -h | Show help for geniex infer. | geniex infer -h |
geniex serve -h | Show help for geniex serve. | geniex serve -h |