Skip to main content

Model inference

geniex pull

Download a model and store it locally.
geniex pull <model-name>[:<precision>]
FlagDescription
--model-hubModel source: aihub | hf | localfs. Auto-detected when omitted.
--local-pathPath to a local directory or AI Hub .zip file. Implies --model-hub localfs.
--model-typeModel type: llm | vlm. Auto-detected when omitted.
Pulling from a local path:
geniex pull local/my-model --local-path /path/to/model-dir
pull copies files into the GenieX cache. After a successful pull you can safely delete the source to avoid keeping two copies.
Precision (Quantization) (llama.cpp only) For GGUF models the CLI prompts you to pick a precision:
Choose a precision version to download
> Q4_0       [1.2 GiB] (default)
  Q8_0       [2.0 GiB]
  F16        [3.8 GiB]
Q4_0 has the best Hexagon NPU support. See Precisions (Quantizations) Supported.
Qualcomm AI Hub Models are pre-quantized — no choice needed.

geniex infer — LLM

Launch an interactive chat session with a language model.
geniex infer ai-hub-models/Qwen3-4B
Thinking mode — control whether the model shows reasoning before responding:
geniex infer ai-hub-models/Qwen3-4B --think         # show reasoning steps
geniex infer ai-hub-models/Qwen3-4B --think=false   # respond directly
Compute unit selection (via --compute) — pick which compute unit runs the model (default: npu):
# llama.cpp models support all compute units
geniex infer unsloth/Qwen3.5-0.8B-GGUF --compute npu
geniex infer unsloth/Qwen3.5-0.8B-GGUF --compute gpu
geniex infer unsloth/Qwen3.5-0.8B-GGUF --compute cpu

# Qualcomm AI Hub Models only support NPU
geniex infer ai-hub-models/Qwen3-4B --compute npu
Qualcomm AI Hub Models run on NPU only. Using --compute cpu or --compute gpu returns an error.

geniex infer — VLM

Run vision-language inference with text-only or image input:
geniex infer ai-hub-models/Qwen2.5-VL-7B-Instruct-GGUF
For text-only, just launch and chat. For image input, provide the absolute path or drag the file into your terminal:
Describe this picture </full/path/to/image.png>

geniex serve

Start the OpenAI-compatible local server. See Local server for the API.
geniex serve

Configuration flags

These flags can be passed to geniex infer to control model loading and generation.

Sampler flags

Control how the model selects tokens during generation.
FlagTypeDefaultDescription
--temperaturefloatSampling temperature. Higher values increase randomness.
--top-pfloatTop-p (nucleus) sampling threshold.
--top-kintTop-k sampling. Only consider the top-k most likely tokens.
--min-pfloatMin-p sampling threshold.
--repetition-penaltyfloat1Penalize repeated tokens. Values > 1 reduce repetition.
--presence-penaltyfloatPenalize tokens that have appeared at all.
--frequency-penaltyfloatPenalize tokens proportional to their frequency.
--seedintRandom seed for reproducible outputs.
--grammar-pathstringPath to a GBNF grammar file for constrained generation.
--grammar-stringstringInline grammar in GBNF string format.
--enable-jsonForce JSON-only output.

Model flags

Control model loading, context, and generation limits.
FlagTypeDefaultDescription
-n, --nglint999Number of layers to offload to GPU.
--nctxint4096Context window size (max input + output tokens).
--max-tokensint2048Maximum tokens to generate per response.
--stopstring[]Stop sequences (can be specified multiple times).
--stop-filestringFile containing stop sequences (one per line).
--think / --think=falsebooltrueEnable or disable thinking mode for reasoning models.
-s, --system-promptstringSystem prompt to set model behavior.

Utility commands

CommandDescriptionExample
geniex listDisplay all downloaded models with their names and sizes.geniex list
geniex remove <model>Remove a specific local model by name.geniex remove unsloth/Qwen3-0.6B-GGUF
geniex cleanDelete all locally cached models.geniex clean
geniex infer -hShow help for geniex infer.geniex infer -h
geniex serve -hShow help for geniex serve.geniex serve -h