Skip to main content

Runtime & compute unit selection

Runtime

Choose the inference runtime via runtime_id:
val runtime_id: String?   // "llama_cpp" | "qairt" | null
runtime_idRuntimeModel formatCompute units
"llama_cpp"llama.cpp + GGML Hexagon backendGGUFCPU / Adreno GPU / Hexagon NPU
"qairt"Qualcomm® AI Engine DirectQualcomm AI Hub pre-compiled binsHexagon NPU only
nullSDK picks based on model paths
Constants are exposed as RuntimeIdValue (LLAMA_CPP, QAIRT).

Compute unit

Friendly compute-unit aliases forwarded to geniex_resolve_device in the native SDK.
val compute_unit: String?   // "cpu" | "gpu" | "npu" | null
AliasEffect
nullRuntime default — npu for llama_cpp, npu for qairt.
"npu"Hexagon NPU acceleration. Recommended on Snapdragon.
"gpu"Adreno GPU via OpenCL (llama_cpp only).
"cpu"Pure CPU. Forces nGpuLayers = 0.
Qualcomm AI Engine Direct only supports NPU. Passing "cpu" or "gpu" with a Qualcomm AI Hub Model logs a warning and falls back to NPU — it won’t error.

Model manager

Models are pulled on-device through the bundled Rust model manager. Do not manually adb push weights — use ModelManagerWrapper.

ModelManagerWrapper

// Init (idempotent — safe to call on every Activity.onCreate).
GenieXSdk.getInstance().init(context)

// Pull with streaming progress.
ModelManagerWrapper.pullFlow(
    ModelPullInput(
        model_name = "unsloth/Qwen3-0.6B-GGUF",
        precision  = "Q4_0",
        hub        = HubSource.HUGGINGFACE,
    )
).collect { event ->
    when (event) {
        is ModelManagerWrapper.PullEvent.Progress  -> /* update UI */
        ModelManagerWrapper.PullEvent.Completed    -> /* done */
        is ModelManagerWrapper.PullEvent.Error     -> /* show error */
    }
}

// Resolve on-disk paths for a previously-pulled model.
val paths: ModelPaths? = ModelManagerWrapper.getPaths("unsloth/Qwen3-0.6B-GGUF")

// Inventory / cleanup.
ModelManagerWrapper.list()              // List<String>
ModelManagerWrapper.remove("org/repo") // 0 = success
ModelManagerWrapper.clean()            // wipe all cached models

ModelPullInput

data class ModelPullInput(
    val model_name:   String,                     // "org/repo" or alias
    val precision:    String? = null,             // precision (quantization) e.g. "Q4_0", "Q4_K_M"
    val hub:          HubSource = HubSource.AUTO, // AUTO routes by model_name
    val local_path:   String? = null,             // only when hub == LOCALFS
    val hf_token:     String? = null,             // falls back to GENIEX_HFTOKEN env
    val chipset:      String? = null,             // required for Qualcomm AI Hub on Android (e.g. "SM8750")
    val display_name: String? = null,
)

HubSource

enum class HubSource(val value: Int) {
    AUTO(0),          // routes by prefix (e.g. ai-hub-models/* → AIHUB)
    HUGGINGFACE(1),
    MODELSCOPE(2),
    AIHUB(3),
    VOLCES(4),
    LOCALFS(127),
}

ModelPaths

Returned by getPaths(). Feed fields directly into LlmCreateInput / VlmCreateInput:
data class ModelPaths(
    val model_path:     String,
    val model_dir:      String,
    val model_name:     String,
    val runtime_id:      String,            // authoritative — prefer over UI selection
    val mmproj_path:    String? = null,    // VLM projection weights
    val tokenizer_path: String? = null,
    val compute_unit:      String? = null,
)
Qualcomm AI Hub pulls on Android require an explicit chipset. The Rust side only auto-detects on Windows on Snapdragon. Use "SM8750" for Snapdragon 8 Elite or "SM8850" for Snapdragon 8 Elite Gen 5.

Data structures

LlmCreateInput

data class LlmCreateInput(
    val model_name:     String,
    val model_path:     String,
    val tokenizer_path: String? = null,
    val config:         ModelConfig,
    val runtime_id:      String? = null,
    val compute_unit:      String? = null,
)

VlmCreateInput

data class VlmCreateInput(
    val model_name:  String,
    val model_path:  String,
    val mmproj_path: String? = null,     // vision projection weights (GGUF VLMs)
    val config:      ModelConfig,
    val runtime_id:   String? = null,
    val compute_unit:   String? = null,
)

ModelConfig

data class ModelConfig(
    var nCtx:                  Int     = 2048,     // context size; 0 = model default
    var nThreads:              Int     = 8,
    var nThreadsBatch:         Int     = 8,
    var nBatch:                Int     = 2048,
    var nUBatch:               Int     = 512,
    var nSeqMax:               Int     = 1,
    var nGpuLayers:            Int     = 0,
    val chat_template_path:    String  = "",
    val chat_template_content: String  = "",
    val max_tokens:            Int     = 2048,
    val enable_thinking:       Boolean = false,
    val verbose:               Boolean = false,
)
nGpuLayers is rewritten by the JNI based on compute_unit: cpu forces 0, npu forces 999. For gpu the value you pass is used as-is — set 999 to offload all layers.

ChatMessage

data class ChatMessage(
    var role:    String,   // "system" | "user" | "assistant"
    var content: String,
)

VlmChatMessage / VlmContent

data class VlmChatMessage(
    val role:     String?,                 // "system" | "user" | "assistant"
    val contents: List<VlmContent>,
)

data class VlmContent(
    val type: String?,                     // "text" | "image"
    val text: String?,                     // text content, or absolute file path for image
)

GenerationConfig

data class GenerationConfig(
    var maxTokens:     Int              = 32,
    var stopWords:     Array<String>?   = null,
    var stopCount:     Int              = 0,
    var nPast:         Int              = 0,
    var samplerConfig: SamplerConfig?   = null,
    var imagePaths:    Array<String>?   = null,
    var imageCount:    Int              = 0,
    var audioPaths:    Array<String>?   = null,
    var audioCount:    Int              = 0,
)
The default maxTokens is 32. Most use cases should set a higher value (e.g. maxTokens = 2048).

LlmStreamResult

sealed class LlmStreamResult {
    data class Token(val text: String)              : LlmStreamResult()
    data class Completed(val profile: ProfilingData) : LlmStreamResult()
    data class Error(val throwable: Throwable)      : LlmStreamResult()
}

llama.cpp (GGUF models)

Runs any GGUF model on CPU, Adreno GPU, or Hexagon NPU. Compute-unit selection is controlled by compute_unit.

LLM

val paths = ModelManagerWrapper.getPaths("unsloth/Qwen3-0.6B-GGUF")
    ?: error("Model not downloaded")

LlmWrapper.builder()
    .llmCreateInput(
        LlmCreateInput(
            model_name = paths.model_name,
            model_path = paths.model_path,
            config     = ModelConfig(nCtx = 4096),
            runtime_id  = "llama_cpp",
            compute_unit  = null,       // null → npu (recommended on Snapdragon)
        )
    )
    .build()
    .onSuccess { llmWrapper = it }
    .onFailure { println("Error: ${it.message}") }

val chat = arrayListOf(ChatMessage("user", "What is AI?"))

llmWrapper.applyChatTemplate(chat.toTypedArray(), null, false).onSuccess { t ->
    llmWrapper.generateStreamFlow(t.formattedText, GenerationConfig(maxTokens = 2048)).collect { result ->
        when (result) {
            is LlmStreamResult.Token     -> print(result.text)
            is LlmStreamResult.Completed -> println("\nDone")
            is LlmStreamResult.Error     -> println("Error: ${result.throwable}")
        }
    }
}

Compute-unit variants

Goalcompute_unitNotes
Snapdragon NPU (recommended)"npu" or nullHexagon NPU acceleration.
Adreno GPU (OpenCL)"gpu"Set nGpuLayers = 999 to offload all layers.
Pure CPU"cpu"Works on any ARM64 chipset.

VLM

GGUF VLMs need two artifacts: the LLM weights (model_path) and the vision projection (mmproj_path). Both come from getPaths():
val paths = ModelManagerWrapper.getPaths("unsloth/Qwen3-VL-2B-Instruct-GGUF")
    ?: error("Model not downloaded")

VlmWrapper.builder()
    .vlmCreateInput(
        VlmCreateInput(
            model_name  = paths.model_name,
            model_path  = paths.model_path,
            mmproj_path = paths.mmproj_path,
            config      = ModelConfig(nCtx = 4096),
            runtime_id   = "llama_cpp",
            compute_unit   = null,
        )
    )
    .build()
    .onSuccess { vlmWrapper = it }

val msg = VlmChatMessage(
    role     = "user",
    contents = listOf(
        VlmContent("image", "/storage/emulated/0/Pictures/example.jpg"),
        VlmContent("text",  "Describe this image."),
    ),
)
val chat = arrayListOf(msg)

vlmWrapper.applyChatTemplate(chat.toTypedArray(), null, false).onSuccess { t ->
    val gen = vlmWrapper.injectMediaPathsToConfig(chat.toTypedArray(), GenerationConfig(maxTokens = 2048))
    vlmWrapper.generateStreamFlow(t.formattedText, gen).collect { result ->
        when (result) {
            is LlmStreamResult.Token     -> print(result.text)
            is LlmStreamResult.Completed -> println("\nDone")
            is LlmStreamResult.Error     -> println("Error: ${result.throwable}")
        }
    }
}
Always pass t.formattedText (the chat-templated prompt) into generateStreamFlow, not the raw user text. The native pipeline treats the prompt as already-templated.

Qualcomm® AI Hub Models (NPU via Qualcomm AI Engine Direct)

Pre-compiled models from Qualcomm AI Hub. NPU-only, pinned to a specific chipset (SM8750 = Snapdragon 8 Elite, SM8850 = Snapdragon 8 Elite Gen 5).

Downloading Qualcomm AI Hub Models

ModelManagerWrapper.pullFlow(
    ModelPullInput(
        model_name = "ai-hub-models/Qwen2.5-VL-7B-Instruct",
        hub        = HubSource.AUTO,     // AUTO routes `ai-hub-models/*` to Qualcomm AI Hub
        chipset    = "SM8750",           // REQUIRED on Android
    )
).collect { event ->
    when (event) {
        is ModelManagerWrapper.PullEvent.Progress  -> updateProgressBar(event.files)
        ModelManagerWrapper.PullEvent.Completed    -> println("done")
        is ModelManagerWrapper.PullEvent.Error     -> println("err ${event.code}: ${event.message}")
    }
}

Supported models

ModalityHub repo
LLMai-hub-models/Qwen3-4B-Instruct-2507
VLMai-hub-models/Qwen2.5-VL-7B-Instruct

LLM

val paths = ModelManagerWrapper.getPaths("ai-hub-models/Qwen3-4B-Instruct-2507")
    ?: error("Model not downloaded")

LlmWrapper.builder()
    .llmCreateInput(
        LlmCreateInput(
            model_name = paths.model_name,
            model_path = paths.model_path,
            config     = ModelConfig(max_tokens = 2048, enable_thinking = false),
            runtime_id  = "qairt",
            compute_unit  = null,           // null → NPU (only option for Qualcomm AI Engine Direct)
        )
    )
    .build()
    .onSuccess { llmWrapper = it }
    .onFailure { println("Error: ${it.message}") }

val chat = arrayListOf(ChatMessage("user", "What is AI?"))

llmWrapper.applyChatTemplate(chat.toTypedArray(), null, false).onSuccess { t ->
    llmWrapper.generateStreamFlow(t.formattedText, GenerationConfig()).collect { result ->
        when (result) {
            is LlmStreamResult.Token     -> print(result.text)
            is LlmStreamResult.Completed -> println("\nDone")
            is LlmStreamResult.Error     -> println("Error: ${result.throwable}")
        }
    }
}
Qualcomm AI Engine Direct rejects nGpuLayers != 0 and nCtx != 0 with PARAM_NOT_SUPPORTED — the KV cache and context length are fixed at compile time by the Qualcomm AI Hub bundle. Leave both at defaults and use max_tokens / enable_thinking only.

VLM

val paths = ModelManagerWrapper.getPaths("ai-hub-models/Qwen2.5-VL-7B-Instruct")
    ?: error("Model not downloaded")

VlmWrapper.builder()
    .vlmCreateInput(
        VlmCreateInput(
            model_name  = paths.model_name,
            model_path  = paths.model_path,
            mmproj_path = paths.mmproj_path,
            config      = ModelConfig(max_tokens = 2048, enable_thinking = false),
            runtime_id   = "qairt",
            compute_unit   = null,
        )
    )
    .build()
    .onSuccess { vlmWrapper = it }

val msg = VlmChatMessage(
    role     = "user",
    contents = listOf(
        VlmContent("image", "/storage/emulated/0/Pictures/cat.jpg"),
        VlmContent("text",  "What's in this image?"),
    ),
)
val chat = arrayListOf(msg)

vlmWrapper.applyChatTemplate(chat.toTypedArray(), null, false).onSuccess { t ->
    val gen = vlmWrapper.injectMediaPathsToConfig(chat.toTypedArray(), GenerationConfig(maxTokens = 2048))
    vlmWrapper.generateStreamFlow(t.formattedText, gen).collect { result ->
        when (result) {
            is LlmStreamResult.Token     -> print(result.text)
            is LlmStreamResult.Completed -> println("\nDone")
            is LlmStreamResult.Error     -> println("Error: ${result.throwable}")
        }
    }
}
Pass the chat-templated prompt (t.formattedText) to generateStreamFlow, never raw user text. Qualcomm AI Engine Direct VLM treats its prompt as already-templated — raw text produces degenerate output.

Need help?

GitHub Issues

File a bug, request a feature, or browse open issues.

Slack

Developer collaboration and resources.