Skip to main content
This page walks through running your first model from a Kotlin app, then swapping in a different model. For a complete reference app with chat UI, model picker, and VLM support, see the sample app.

Prerequisites

  • The SDK added to your Gradle project — see Install.
  • A phone running Snapdragon 8 Elite or Snapdragon 8 Elite Gen 5.
  • INTERNET permission in your AndroidManifest.xml (the SDK pulls weights from Hugging Face / Qualcomm AI Hub on first use).

Run your first model

The flow is the same regardless of model: init the SDK → pull weights → load → generate. Below is a minimal end-to-end example using unsloth/Qwen3-0.6B-GGUF — a small Qwen3 0.6B chat model that runs on any supported chipset.
1

Init the SDK

Call once on app startup (idempotent — safe inside Activity.onCreate):
GenieXSdk.getInstance().init(context)
2

Pull the model

pullFlow streams progress events. Run inside a coroutine on Dispatchers.IO:
ModelManagerWrapper.pullFlow(
    ModelPullInput(
        model_name = "unsloth/Qwen3-0.6B-GGUF",
        precision  = "Q4_0",
        hub        = HubSource.HUGGINGFACE,
    )
).collect { event ->
    when (event) {
        is ModelManagerWrapper.PullEvent.Progress  -> /* update UI */
        ModelManagerWrapper.PullEvent.Completed    -> /* done */
        is ModelManagerWrapper.PullEvent.Error     -> /* show error */
    }
}
Downloads are resumable — killing the app mid-pull and re-running picks up where it left off.
3

Load the model

Resolve the on-disk paths and build an LlmWrapper:
val paths = ModelManagerWrapper.getPaths("unsloth/Qwen3-0.6B-GGUF")
    ?: error("Model not downloaded")

val llm = LlmWrapper.builder()
    .llmCreateInput(
        LlmCreateInput(
            model_name = paths.model_name,
            model_path = paths.model_path,
            config     = ModelConfig(nCtx = 4096),
            runtime_id  = "llama_cpp",
            compute_unit  = null,   // null → NPU on Snapdragon (recommended)
        )
    )
    .build()
    .getOrThrow()
4

Generate

Apply the chat template, then collect tokens from the streaming flow:
val chat = arrayListOf(ChatMessage("user", "What is AI?"))
val templated = llm.applyChatTemplate(chat.toTypedArray(), null, false).getOrThrow()

llm.generateStreamFlow(
    templated.formattedText,
    GenerationConfig(maxTokens = 2048),
).collect { result ->
    when (result) {
        is LlmStreamResult.Token     -> print(result.text)
        is LlmStreamResult.Completed -> println("\nDone")
        is LlmStreamResult.Error     -> println("Error: ${result.throwable}")
    }
}
Always pass templated.formattedText (the chat-templated prompt) into generateStreamFlow, not the raw user text. The native pipeline expects an already-templated prompt.

Switching models

Swapping models is mostly a matter of changing the model_name and the runtime_id. There are two runtimes:
  • llama_cpp — runs any GGUF model. Supports NPU / GPU / CPU compute units via compute_unit.
  • qairt (Qualcomm AI Engine Direct) — runs Qualcomm AI Hub Models. NPU-only, requires an explicit chipset on Android.

Another GGUF model (llama.cpp)

Just change the model_name (and precision if you want a different one) — the rest of the flow is identical:
ModelPullInput(
    model_name = "unsloth/Qwen3-VL-2B-Instruct-GGUF",
    precision  = "Q4_0",
    hub        = HubSource.HUGGINGFACE,
)
For VLMs, also pass paths.mmproj_path into VlmCreateInput — see API reference → VLM.

A Qualcomm AI Hub Model (NPU via Qualcomm AI Engine Direct)

Qualcomm AI Hub Models are pre-compiled per chipset and only run on the NPU. You must pass chipset on Android:
ModelManagerWrapper.pullFlow(
    ModelPullInput(
        model_name = "ai-hub-models/Qwen3-4B-Instruct-2507",
        hub        = HubSource.AUTO,   // routes ai-hub-models/* to Qualcomm AI Hub
        chipset    = "SM8750",         // SM8750 = 8 Elite, SM8850 = 8 Elite Gen 5
    )
).collect { /* … */ }
Then switch runtime_id = "qairt" in LlmCreateInput. See the supported Qualcomm AI Hub repos in the API reference.

Switching compute unit (NPU / GPU / CPU)

For llama_cpp only — set compute_unit on LlmCreateInput:
compute_unitCompute unit
null or "npu"Hexagon NPU (recommended on Snapdragon).
"gpu"Adreno GPU via OpenCL.
"cpu"Pure CPU. Works on any ARM64 chipset.
Qualcomm AI Engine Direct ignores this — cpu/gpu are coerced to NPU with a warning.

Using a local model

If the weights are already on the device — side-loaded via adb push, bundled in your app’s files dir, or produced by another tool — point the model manager at that directory instead of a hub: set hub = HubSource.LOCALFS and local_path to the on-disk location. pullFlow imports it into the SDK cache (no network), after which getPaths / LlmWrapper work exactly as they do for a downloaded model. The full Android snippets for importing a local GGUF model and a local Qualcomm AI Engine Direct bundle live on the Models page:

Using the sample app

The sample app is a fully wired chat client built on top of the snippets above. A few patterns worth borrowing when you build your own UI:
  • Model picker UI — the dropdown is driven by app/src/main/assets/model_list.json. Each entry pins a model_name, hub, and (for Qualcomm AI Engine Direct) a chipset. Edit this file to add new models without touching code.
  • Resumable downloads with progress — the Progress events from pullFlow carry per-file byte counts; the sample wires them straight into a LinearProgressIndicator.
  • Runtime-aware compute-unit picker — when the selected model uses Qualcomm AI Engine Direct, the picker hides GPU/CPU options. See LoadDialog.kt.
  • VLM image picker — for VLMs, the sample passes the absolute file path into VlmContent("image", path). Don’t pass content URIs — the native side reads the file directly.
Clone qualcomm/ai-hub-apps, open it in Android Studio, and hit Run ▶.

Next steps

API reference

Wrapper classes, runtime / compute-unit selection, and data structures.

Platforms & runtimes

Snapdragon platforms and when to pick llama.cpp vs Qualcomm AI Engine Direct.