> ## Documentation Index > Fetch the complete documentation index at: https://geniex.aihub.qualcomm.com/llms.txt > Use this file to discover all available pages before exploring further. # CLI reference > Every GenieX CLI command and flag, with usage examples. ## **Model inference** ### **`geniex pull`** Download a model and store it locally. ```powershell theme={"dark"} geniex pull [:] ``` | Flag | Description | | -------------- | ------------------------------------------------------------------------------- | | `--model-hub` | Model source: `aihub` \| `hf` \| `localfs`. Auto-detected when omitted. | | `--local-path` | Path to a local directory or AI Hub `.zip` file. Implies `--model-hub localfs`. | | `--model-type` | Model type: `llm` \| `vlm`. Auto-detected when omitted. | **Pulling from a local path:** ```powershell theme={"dark"} geniex pull local/my-model --local-path /path/to/model-dir ``` `pull` copies files into the GenieX cache. After a successful pull you can safely delete the source to avoid keeping two copies. **Precision (Quantization) (llama.cpp only)** For GGUF models the CLI prompts you to pick a precision: ```powershell theme={"dark"} Choose a precision version to download > Q4_0 [1.2 GiB] (default) Q8_0 [2.0 GiB] F16 [3.8 GiB] ``` `Q4_0` has the best Hexagon NPU support. See [Precisions (Quantizations) Supported](/en/models/supported#precisions-quantizations-supported). Qualcomm AI Hub Models are pre-quantized — no choice needed. ### **`geniex infer` — LLM** Launch an interactive chat session with a language model. ```powershell theme={"dark"} geniex infer ai-hub-models/Qwen3-4B ``` **Thinking mode** — control whether the model shows reasoning before responding: ```powershell theme={"dark"} geniex infer ai-hub-models/Qwen3-4B --think # show reasoning steps geniex infer ai-hub-models/Qwen3-4B --think=false # respond directly ``` **Compute unit selection** (via `--compute`) — pick which compute unit runs the model (default: `npu`): ```powershell theme={"dark"} # llama.cpp models support all compute units geniex infer unsloth/Qwen3.5-0.8B-GGUF --compute npu geniex infer unsloth/Qwen3.5-0.8B-GGUF --compute gpu geniex infer unsloth/Qwen3.5-0.8B-GGUF --compute cpu # Qualcomm AI Hub Models only support NPU geniex infer ai-hub-models/Qwen3-4B --compute npu ``` Qualcomm AI Hub Models run on NPU only. Using `--compute cpu` or `--compute gpu` returns an error. ### **`geniex infer` — VLM** Run vision-language inference with text-only or image input: ```bash theme={"dark"} geniex infer ai-hub-models/Qwen2.5-VL-7B-Instruct-GGUF ``` For text-only, just launch and chat. For image input, provide the **absolute path** or drag the file into your terminal: ```bash theme={"dark"} Describe this picture ``` ## **`geniex serve`** Start the OpenAI-compatible local server. See [Local server](/en/run/cli/local-server) for the API. ```bash theme={"dark"} geniex serve ``` ## **Configuration flags** These flags can be passed to `geniex infer` to control model loading and generation. ### Sampler flags Control how the model selects tokens during generation. | Flag | Type | Default | Description | | ---------------------- | ------ | ------- | ----------------------------------------------------------- | | `--temperature` | float | — | Sampling temperature. Higher values increase randomness. | | `--top-p` | float | — | Top-p (nucleus) sampling threshold. | | `--top-k` | int | — | Top-k sampling. Only consider the top-k most likely tokens. | | `--min-p` | float | — | Min-p sampling threshold. | | `--repetition-penalty` | float | `1` | Penalize repeated tokens. Values > 1 reduce repetition. | | `--presence-penalty` | float | — | Penalize tokens that have appeared at all. | | `--frequency-penalty` | float | — | Penalize tokens proportional to their frequency. | | `--seed` | int | — | Random seed for reproducible outputs. | | `--grammar-path` | string | — | Path to a GBNF grammar file for constrained generation. | | `--grammar-string` | string | — | Inline grammar in GBNF string format. | | `--enable-json` | — | — | Force JSON-only output. | ### Model flags Control model loading, context, and generation limits. | Flag | Type | Default | Description | | --------------------------- | --------- | ------- | ----------------------------------------------------- | | `-n`, `--ngl` | int | `999` | Number of layers to offload to GPU. | | `--nctx` | int | `4096` | Context window size (max input + output tokens). | | `--max-tokens` | int | `2048` | Maximum tokens to generate per response. | | `--stop` | string\[] | — | Stop sequences (can be specified multiple times). | | `--stop-file` | string | — | File containing stop sequences (one per line). | | `--think` / `--think=false` | bool | `true` | Enable or disable thinking mode for reasoning models. | | `-s`, `--system-prompt` | string | — | System prompt to set model behavior. | ## **Utility commands** | Command | Description | Example | | ----------------------- | --------------------------------------------------------- | --------------------------------------- | | `geniex list` | Display all downloaded models with their names and sizes. | `geniex list` | | `geniex remove ` | Remove a specific local model by name. | `geniex remove unsloth/Qwen3-0.6B-GGUF` | | `geniex clean` | Delete all locally cached models. | `geniex clean` | | `geniex infer -h` | Show help for `geniex infer`. | `geniex infer -h` | | `geniex serve -h` | Show help for `geniex serve`. | `geniex serve -h` |

Was this page helpful?

Yes