> ## Documentation Index
> Fetch the complete documentation index at: https://geniex.aihub.qualcomm.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Local server

> Run an OpenAI-compatible HTTP API on localhost backed by Snapdragon NPU/GPU/CPU acceleration.

GenieX includes a built-in inference server that exposes an **OpenAI-compatible API**. Run models on-device and connect them to any application or framework that speaks the OpenAI protocol — agentic frameworks like **LangChain**, AI-native apps like **OpenClaw**, or your own code. No cloud dependency.

## **Prerequisites**

* The CLI installed — see [Install](/en/run/cli/install).
* Interactive shell from container (Docker only) — see [Run interactively](/en/run/linux/install#run-interactively).
* A model pulled. `geniex serve` does **not** auto-download models.

## **Start the server**

Pull a model:

```bash bash theme={"dark"}
geniex pull ai-hub-models/Qwen3-4B-Instruct-2507
```

Start the server:

```bash bash theme={"dark"}
geniex serve
```

The server runs on `http://127.0.0.1:18181` by default. Keep this terminal open and make requests from another one. Run `geniex serve -h` for all configurable options.

## **POST /v1/chat/completions**

Creates a model response for a conversation. Supports LLM (text-only) and VLM (image + text).

### **LLM request**

```json Example Value theme={"dark"}
{
  "model": "ai-hub-models/Qwen3-4B-Instruct-2507",
  "messages": [
    {"role": "user", "content": "Hello! Briefly introduce yourself."}
  ],
  "max_tokens": 256,
  "temperature": 0.7,
  "stream": false
}
```

### **Try it from Swagger UI**

Open `http://127.0.0.1:18181` in your browser to access the built-in Swagger UI.

**Step 1.** Expand the `POST /v1/chat/completions` endpoint to view the example request body and schema.

<img src="https://mintcdn.com/qualcomm-0801e48b/VijZ6eXFSGIGNc9Z/Images/server/curl-1.png?fit=max&auto=format&n=VijZ6eXFSGIGNc9Z&q=85&s=02340eb8c5b27abce2cfd1bc5a5541b0" alt="Swagger UI showing the chat completions endpoint with example request body" width="1376" height="1059" data-path="Images/server/curl-1.png" />

**Step 2.** Click **Try it out**, edit the request body as needed, then click **Execute**.

<img src="https://mintcdn.com/qualcomm-0801e48b/VijZ6eXFSGIGNc9Z/Images/server/curl-2.png?fit=max&auto=format&n=VijZ6eXFSGIGNc9Z&q=85&s=588ba0820785188699e9b2885cbc7ece" alt="Editing the request body in Try-it-out mode before executing" width="1387" height="1447" data-path="Images/server/curl-2.png" />

**Step 3.** View the response — a `200` status with the model's generated reply.

<img src="https://mintcdn.com/qualcomm-0801e48b/VijZ6eXFSGIGNc9Z/Images/server/curl-3.png?fit=max&auto=format&n=VijZ6eXFSGIGNc9Z&q=85&s=e76944247469636be79fe528bee9ec1e" alt="Response body showing a successful 200 response with the model's reply" width="1353" height="737" data-path="Images/server/curl-3.png" />

### **VLM request**

`image_url.url` accepts three formats:

| Format                                             | Example                                                         |
| -------------------------------------------------- | --------------------------------------------------------------- |
| Local file path (the `file://` prefix is optional) | `C:/Users/Username/Pictures/photo.jpg`, `file:///tmp/photo.jpg` |
| HTTP / HTTPS URL — fetched by the server           | `https://example.com/image.jpg`                                 |
| Base64 data URL — inline image bytes               | `data:image/png;base64,iVBORw0KGgo...`                          |

<Note>
  **Running in Docker?** Local paths are resolved **inside the container**, not on your host. The install command already mounts `$PWD/data` to `/data` — drop your images there and pass `/data/cat.jpg`. Alternatively, use an HTTP URL or base64 data URL to skip the filesystem entirely.
</Note>

```json Example Value theme={"dark"}
{
  "model": "ai-hub-models/Qwen2.5-VL-7B-Instruct",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this image succinctly."},
        {"type": "image_url", "image_url": {"url": "</path/to/image>"}}
      ]
    }
  ]
}
```

In Swagger UI, replace the request body with this VLM payload, point `image_url.url` to a local image, then click **Execute**.

<img src="https://mintcdn.com/qualcomm-0801e48b/VijZ6eXFSGIGNc9Z/Images/server/curl-4.png?fit=max&auto=format&n=VijZ6eXFSGIGNc9Z&q=85&s=c41c6a3a155739709d96ff355e832ee9" alt="Editing the VLM request body in Try-it-out mode before executing" width="1871" height="968" data-path="Images/server/curl-4.png" />

<img src="https://mintcdn.com/qualcomm-0801e48b/VijZ6eXFSGIGNc9Z/Images/server/curl-5.png?fit=max&auto=format&n=VijZ6eXFSGIGNc9Z&q=85&s=e01e48ddc6441fd8bbe23bb656482e22" alt="Response body showing a successful 200 response with the VLM's image description" width="1844" height="634" data-path="Images/server/curl-5.png" />

## **Python client (OpenAI SDK)**

Because the server speaks the OpenAI protocol, you can point the official `openai` Python client at the local endpoint and reuse any existing OpenAI code. Install with `pip install openai`, then:

```python python theme={"dark"}
from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:18181/v1",
    api_key="geniex",  # any non-empty string; the server does not check it
)

stream = client.chat.completions.create(
    model="unsloth/Qwen3-4B-GGUF:Q4_0",  # org/repo[:precision] — Q4_0 has the best Hexagon NPU support
    messages=[
        {"role": "user", "content": "Hello! Briefly introduce yourself."},
    ],
    max_tokens=256,
    temperature=0.7,
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)
print()
```

<Note>
  Replace the `model` value with a model you have already pulled. The optional `:<precision>` suffix selects a precision (quantization) variant (e.g. `Q4_0`, `Q4_K_M`, `Q8_0`) — `Q4_0` is recommended for llama.cpp on Hexagon NPU. See [Precisions (Quantizations) Supported](/en/models/supported#precisions-quantizations-supported).
</Note>

## **Other endpoints**

* `GET /v1/models` — list available models.
* `GET /v1/models/{model}` — get info about a specific model.

<br />

<div class="feedback-wrapper">
  <span class="feedback-label">Was this page helpful?</span>

  <div class="feedback-toggle">
    <input type="radio" name="feedback" id="feedback-yes" class="feedback-input" />

    <label for="feedback-yes" class="feedback-button">
      <img src="https://mintcdn.com/qualcomm-0801e48b/VijZ6eXFSGIGNc9Z/Images/FeedBack/thumbs-up.svg?fit=max&auto=format&n=VijZ6eXFSGIGNc9Z&q=85&s=9fa1e132909ee8667d02f775cd8bc108" alt="Thumbs up" class="feedback-icon" noZoom width="14" height="14" data-path="Images/FeedBack/thumbs-up.svg" />

      Yes
    </label>

    <input type="radio" name="feedback" id="feedback-no" class="feedback-input" />

    <label for="feedback-no" class="feedback-button">
      <img src="https://mintcdn.com/qualcomm-0801e48b/VijZ6eXFSGIGNc9Z/Images/FeedBack/thumbs-down.svg?fit=max&auto=format&n=VijZ6eXFSGIGNc9Z&q=85&s=7f5f8865b502b5262849a700e6989278" alt="Thumbs down" class="feedback-icon" noZoom width="14" height="14" data-path="Images/FeedBack/thumbs-down.svg" />

      No
    </label>
  </div>
</div>
