Prerequisites
- The SDK added to your Gradle project — see Install.
- A phone running Snapdragon 8 Elite or Snapdragon 8 Elite Gen 5.
INTERNETpermission in yourAndroidManifest.xml(the SDK pulls weights from Hugging Face / Qualcomm AI Hub on first use).
Run your first model
The flow is the same regardless of model: init the SDK → pull weights → load → generate. Below is a minimal end-to-end example usingunsloth/Qwen3-0.6B-GGUF — a small Qwen3 0.6B chat model that runs on any supported chipset.
Pull the model
pullFlow streams progress events. Run inside a coroutine on Dispatchers.IO:Switching models
Swapping models is mostly a matter of changing themodel_name and the runtime_id. There are two runtimes:
llama_cpp— runs any GGUF model. Supports NPU / GPU / CPU compute units viacompute_unit.qairt(Qualcomm AI Engine Direct) — runs Qualcomm AI Hub Models. NPU-only, requires an explicitchipseton Android.
Another GGUF model (llama.cpp)
Just change themodel_name (and precision if you want a different one) — the rest of the flow is identical:
paths.mmproj_path into VlmCreateInput — see API reference → VLM.
A Qualcomm AI Hub Model (NPU via Qualcomm AI Engine Direct)
Qualcomm AI Hub Models are pre-compiled per chipset and only run on the NPU. You must passchipset on Android:
runtime_id = "qairt" in LlmCreateInput. See the supported Qualcomm AI Hub repos in the API reference.
Switching compute unit (NPU / GPU / CPU)
Forllama_cpp only — set compute_unit on LlmCreateInput:
compute_unit | Compute unit |
|---|---|
null or "npu" | Hexagon NPU (recommended on Snapdragon). |
"gpu" | Adreno GPU via OpenCL. |
"cpu" | Pure CPU. Works on any ARM64 chipset. |
cpu/gpu are coerced to NPU with a warning.
Using a local model
If the weights are already on the device — side-loaded viaadb push, bundled in your app’s files dir, or produced by another tool — point the model manager at that directory instead of a hub: set hub = HubSource.LOCALFS and local_path to the on-disk location. pullFlow imports it into the SDK cache (no network), after which getPaths / LlmWrapper work exactly as they do for a downloaded model.
The full Android snippets for importing a local GGUF model and a local Qualcomm AI Engine Direct bundle live on the Models page:
Using the sample app
The sample app is a fully wired chat client built on top of the snippets above. A few patterns worth borrowing when you build your own UI:- Model picker UI — the dropdown is driven by
app/src/main/assets/model_list.json. Each entry pins amodel_name,hub, and (for Qualcomm AI Engine Direct) achipset. Edit this file to add new models without touching code. - Resumable downloads with progress — the
Progressevents frompullFlowcarry per-file byte counts; the sample wires them straight into aLinearProgressIndicator. - Runtime-aware compute-unit picker — when the selected model uses Qualcomm AI Engine Direct, the picker hides GPU/CPU options. See
LoadDialog.kt. - VLM image picker — for VLMs, the sample passes the absolute file path into
VlmContent("image", path). Don’t pass content URIs — the native side reads the file directly.
qualcomm/ai-hub-apps, open it in Android Studio, and hit Run ▶.
Next steps
API reference
Wrapper classes, runtime / compute-unit selection, and data structures.
Platforms & runtimes
Snapdragon platforms and when to pick llama.cpp vs Qualcomm AI Engine Direct.
Was this page helpful?