> ## Documentation Index
> Fetch the complete documentation index at: https://geniex.aihub.qualcomm.com/llms.txt
> Use this file to discover all available pages before exploring further.

# CLI reference

> Every GenieX CLI command and flag, with usage examples.

## **Model inference**

### **`geniex pull`**

Download a model and store it locally.

```powershell theme={"dark"}
geniex pull <model-name>[:<precision>]
```

| Flag           | Description                                                                     |
| -------------- | ------------------------------------------------------------------------------- |
| `--model-hub`  | Model source: `aihub` \| `hf` \| `localfs`. Auto-detected when omitted.         |
| `--local-path` | Path to a local directory or AI Hub `.zip` file. Implies `--model-hub localfs`. |
| `--model-type` | Model type: `llm` \| `vlm`. Auto-detected when omitted.                         |

**Pulling from a local path:**

```powershell theme={"dark"}
geniex pull local/my-model --local-path /path/to/model-dir
```

<Note>`pull` copies files into the GenieX cache. After a successful pull you can safely delete the source to avoid keeping two copies.</Note>

**Precision (Quantization) (llama.cpp only)**

For GGUF models the CLI prompts you to pick a precision:

```powershell theme={"dark"}
Choose a precision version to download
> Q4_0       [1.2 GiB] (default)
  Q8_0       [2.0 GiB]
  F16        [3.8 GiB]
```

<Tip>`Q4_0` has the best Hexagon NPU support. See [Precisions (Quantizations) Supported](/en/models/supported#precisions-quantizations-supported).</Tip>

Qualcomm AI Hub Models are pre-quantized — no choice needed.

### **`geniex infer` — LLM**

Launch an interactive chat session with a language model.

```powershell theme={"dark"}
geniex infer ai-hub-models/Qwen3-4B
```

**Thinking mode** — control whether the model shows reasoning before responding:

```powershell theme={"dark"}
geniex infer ai-hub-models/Qwen3-4B --think         # show reasoning steps
geniex infer ai-hub-models/Qwen3-4B --think=false   # respond directly
```

**Compute unit selection** (via `--compute`) — pick which compute unit runs the model (default: `npu`):

```powershell theme={"dark"}
# llama.cpp models support all compute units
geniex infer unsloth/Qwen3.5-0.8B-GGUF --compute npu
geniex infer unsloth/Qwen3.5-0.8B-GGUF --compute gpu
geniex infer unsloth/Qwen3.5-0.8B-GGUF --compute cpu

# Qualcomm AI Hub Models only support NPU
geniex infer ai-hub-models/Qwen3-4B --compute npu
```

<Warning>Qualcomm AI Hub Models run on NPU only. Using `--compute cpu` or `--compute gpu` returns an error.</Warning>

### **`geniex infer` — VLM**

Run vision-language inference with text-only or image input:

```bash theme={"dark"}
geniex infer ai-hub-models/Qwen2.5-VL-7B-Instruct-GGUF
```

For text-only, just launch and chat. For image input, provide the **absolute path** or drag the file into your terminal:

```bash theme={"dark"}
Describe this picture </full/path/to/image.png>
```

## **`geniex serve`**

Start the OpenAI-compatible local server. See [Local server](/en/run/cli/local-server) for the API.

```bash theme={"dark"}
geniex serve
```

## **Configuration flags**

These flags can be passed to `geniex infer` to control model loading and generation.

### Sampler flags

Control how the model selects tokens during generation.

| Flag                   | Type   | Default | Description                                                 |
| ---------------------- | ------ | ------- | ----------------------------------------------------------- |
| `--temperature`        | float  | —       | Sampling temperature. Higher values increase randomness.    |
| `--top-p`              | float  | —       | Top-p (nucleus) sampling threshold.                         |
| `--top-k`              | int    | —       | Top-k sampling. Only consider the top-k most likely tokens. |
| `--min-p`              | float  | —       | Min-p sampling threshold.                                   |
| `--repetition-penalty` | float  | `1`     | Penalize repeated tokens. Values > 1 reduce repetition.     |
| `--presence-penalty`   | float  | —       | Penalize tokens that have appeared at all.                  |
| `--frequency-penalty`  | float  | —       | Penalize tokens proportional to their frequency.            |
| `--seed`               | int    | —       | Random seed for reproducible outputs.                       |
| `--grammar-path`       | string | —       | Path to a GBNF grammar file for constrained generation.     |
| `--grammar-string`     | string | —       | Inline grammar in GBNF string format.                       |
| `--enable-json`        | —      | —       | Force JSON-only output.                                     |

### Model flags

Control model loading, context, and generation limits.

| Flag                        | Type      | Default | Description                                           |
| --------------------------- | --------- | ------- | ----------------------------------------------------- |
| `-n`, `--ngl`               | int       | `999`   | Number of layers to offload to GPU.                   |
| `--nctx`                    | int       | `4096`  | Context window size (max input + output tokens).      |
| `--max-tokens`              | int       | `2048`  | Maximum tokens to generate per response.              |
| `--stop`                    | string\[] | —       | Stop sequences (can be specified multiple times).     |
| `--stop-file`               | string    | —       | File containing stop sequences (one per line).        |
| `--think` / `--think=false` | bool      | `true`  | Enable or disable thinking mode for reasoning models. |
| `-s`, `--system-prompt`     | string    | —       | System prompt to set model behavior.                  |

## **Utility commands**

| Command                 | Description                                               | Example                                 |
| ----------------------- | --------------------------------------------------------- | --------------------------------------- |
| `geniex list`           | Display all downloaded models with their names and sizes. | `geniex list`                           |
| `geniex remove <model>` | Remove a specific local model by name.                    | `geniex remove unsloth/Qwen3-0.6B-GGUF` |
| `geniex clean`          | Delete all locally cached models.                         | `geniex clean`                          |
| `geniex infer -h`       | Show help for `geniex infer`.                             | `geniex infer -h`                       |
| `geniex serve -h`       | Show help for `geniex serve`.                             | `geniex serve -h`                       |

<br />

<div class="feedback-wrapper">
  <span class="feedback-label">Was this page helpful?</span>

  <div class="feedback-toggle">
    <input type="radio" name="feedback" id="feedback-yes" class="feedback-input" />

    <label for="feedback-yes" class="feedback-button">
      <img src="https://mintcdn.com/qualcomm-0801e48b/VijZ6eXFSGIGNc9Z/Images/FeedBack/thumbs-up.svg?fit=max&auto=format&n=VijZ6eXFSGIGNc9Z&q=85&s=9fa1e132909ee8667d02f775cd8bc108" alt="Thumbs up" class="feedback-icon" noZoom width="14" height="14" data-path="Images/FeedBack/thumbs-up.svg" />

      Yes
    </label>

    <input type="radio" name="feedback" id="feedback-no" class="feedback-input" />

    <label for="feedback-no" class="feedback-button">
      <img src="https://mintcdn.com/qualcomm-0801e48b/VijZ6eXFSGIGNc9Z/Images/FeedBack/thumbs-down.svg?fit=max&auto=format&n=VijZ6eXFSGIGNc9Z&q=85&s=7f5f8865b502b5262849a700e6989278" alt="Thumbs down" class="feedback-icon" noZoom width="14" height="14" data-path="Images/FeedBack/thumbs-down.svg" />

      No
    </label>
  </div>
</div>
