跳转到主要内容

运行环境与计算单元选择

运行环境

通过 runtime_id 选择推理运行环境:
val runtime_id: String?   // "llama_cpp" | "qairt" | null
runtime_id运行环境模型格式计算单元
"llama_cpp"llama.cpp + GGML Hexagon 后端GGUFCPU / Adreno GPU / Hexagon NPU
"qairt"Qualcomm® AI Engine DirectQualcomm AI Hub 预编译 bin仅 Hexagon NPU
nullSDK 根据模型路径自动判断
常量定义见 RuntimeIdValueLLAMA_CPPQAIRT)。

计算单元

给原生 SDK geniex_resolve_device 的计算单元别名。
val compute_unit: String?   // "cpu" | "gpu" | "npu" | null
别名行为
null运行环境默认值——llama_cppqairt 均为 npu
"npu"Hexagon NPU 加速。骁龙上推荐。
"gpu"通过 OpenCL 使用 Adreno GPU(仅 llama_cpp)。
"cpu"纯 CPU。强制 nGpuLayers = 0
Qualcomm AI Engine Direct 仅支持 NPU。对 Qualcomm AI Hub Model 传 "cpu""gpu" 仅记录告警并回退到 NPU,不会报错。

模型管理器

模型通过内置的 Rust 模型管理器在设备上拉取。请勿手动 adb push 权重——使用 ModelManagerWrapper

ModelManagerWrapper

// Init (idempotent — safe to call on every Activity.onCreate).
GenieXSdk.getInstance().init(context)

// Pull with streaming progress.
ModelManagerWrapper.pullFlow(
    ModelPullInput(
        model_name = "unsloth/Qwen3-0.6B-GGUF",
        precision  = "Q4_0",
        hub        = HubSource.HUGGINGFACE,
    )
).collect { event ->
    when (event) {
        is ModelManagerWrapper.PullEvent.Progress  -> /* update UI */
        ModelManagerWrapper.PullEvent.Completed    -> /* done */
        is ModelManagerWrapper.PullEvent.Error     -> /* show error */
    }
}

// Resolve on-disk paths for a previously-pulled model.
val paths: ModelPaths? = ModelManagerWrapper.getPaths("unsloth/Qwen3-0.6B-GGUF")

// Inventory / cleanup.
ModelManagerWrapper.list()              // List<String>
ModelManagerWrapper.remove("org/repo") // 0 = success
ModelManagerWrapper.clean()            // wipe all cached models

ModelPullInput

data class ModelPullInput(
    val model_name:   String,                     // "org/repo" or alias
    val precision:    String? = null,             // precision (quantization) e.g. "Q4_0", "Q4_K_M"
    val hub:          HubSource = HubSource.AUTO, // AUTO routes by model_name
    val local_path:   String? = null,             // only when hub == LOCALFS
    val hf_token:     String? = null,             // falls back to GENIEX_HFTOKEN env
    val chipset:      String? = null,             // required for Qualcomm AI Hub on Android (e.g. "SM8750")
    val display_name: String? = null,
)

HubSource

enum class HubSource(val value: Int) {
    AUTO(0),          // routes by prefix (e.g. ai-hub-models/* → AIHUB)
    HUGGINGFACE(1),
    MODELSCOPE(2),
    AIHUB(3),
    VOLCES(4),
    LOCALFS(127),
}

ModelPaths

getPaths() 返回。其字段可直接传入 LlmCreateInput / VlmCreateInput
data class ModelPaths(
    val model_path:     String,
    val model_dir:      String,
    val model_name:     String,
    val runtime_id:      String,            // authoritative — prefer over UI selection
    val mmproj_path:    String? = null,    // VLM projection weights
    val tokenizer_path: String? = null,
    val compute_unit:      String? = null,
)
在 Android 上拉取 Qualcomm AI Hub 必须显式传 chipset。Rust 侧仅在骁龙 Windows 上自动识别。骁龙 8 至尊版用 "SM8750",骁龙 8 至尊版 Gen 5 用 "SM8850"

数据结构

LlmCreateInput

data class LlmCreateInput(
    val model_name:     String,
    val model_path:     String,
    val tokenizer_path: String? = null,
    val config:         ModelConfig,
    val runtime_id:      String? = null,
    val compute_unit:      String? = null,
)

VlmCreateInput

data class VlmCreateInput(
    val model_name:  String,
    val model_path:  String,
    val mmproj_path: String? = null,     // vision projection weights (GGUF VLMs)
    val config:      ModelConfig,
    val runtime_id:   String? = null,
    val compute_unit:   String? = null,
)

ModelConfig

data class ModelConfig(
    var nCtx:                  Int     = 2048,     // context size; 0 = model default
    var nThreads:              Int     = 8,
    var nThreadsBatch:         Int     = 8,
    var nBatch:                Int     = 2048,
    var nUBatch:               Int     = 512,
    var nSeqMax:               Int     = 1,
    var nGpuLayers:            Int     = 0,
    val chat_template_path:    String  = "",
    val chat_template_content: String  = "",
    val max_tokens:            Int     = 2048,
    val enable_thinking:       Boolean = false,
    val verbose:               Boolean = false,
)
JNI 会根据 compute_unit 改写 nGpuLayerscpu 强制为 0,npu 强制为 999。gpu 会保留传入值——如要全层卸载,请设为 999。

ChatMessage

data class ChatMessage(
    var role:    String,   // "system" | "user" | "assistant"
    var content: String,
)

VlmChatMessage / VlmContent

data class VlmChatMessage(
    val role:     String?,                 // "system" | "user" | "assistant"
    val contents: List<VlmContent>,
)

data class VlmContent(
    val type: String?,                     // "text" | "image"
    val text: String?,                     // text content, or absolute file path for image
)

GenerationConfig

data class GenerationConfig(
    var maxTokens:     Int              = 32,
    var stopWords:     Array<String>?   = null,
    var stopCount:     Int              = 0,
    var nPast:         Int              = 0,
    var samplerConfig: SamplerConfig?   = null,
    var imagePaths:    Array<String>?   = null,
    var imageCount:    Int              = 0,
    var audioPaths:    Array<String>?   = null,
    var audioCount:    Int              = 0,
)
maxTokens 默认为 32。多数场景应设为更大值(例如 maxTokens = 2048)。

LlmStreamResult

sealed class LlmStreamResult {
    data class Token(val text: String)              : LlmStreamResult()
    data class Completed(val profile: ProfilingData) : LlmStreamResult()
    data class Error(val throwable: Throwable)      : LlmStreamResult()
}

llama.cpp(GGUF 模型)

可在 CPU、Adreno GPU 或 Hexagon NPU 上运行任意 GGUF 模型,计算单元由 compute_unit 控制。

LLM

val paths = ModelManagerWrapper.getPaths("unsloth/Qwen3-0.6B-GGUF")
    ?: error("Model not downloaded")

LlmWrapper.builder()
    .llmCreateInput(
        LlmCreateInput(
            model_name = paths.model_name,
            model_path = paths.model_path,
            config     = ModelConfig(nCtx = 4096),
            runtime_id  = "llama_cpp",
            compute_unit  = null,       // null → npu (recommended on Snapdragon)
        )
    )
    .build()
    .onSuccess { llmWrapper = it }
    .onFailure { println("Error: ${it.message}") }

val chat = arrayListOf(ChatMessage("user", "What is AI?"))

llmWrapper.applyChatTemplate(chat.toTypedArray(), null, false).onSuccess { t ->
    llmWrapper.generateStreamFlow(t.formattedText, GenerationConfig(maxTokens = 2048)).collect { result ->
        when (result) {
            is LlmStreamResult.Token     -> print(result.text)
            is LlmStreamResult.Completed -> println("\nDone")
            is LlmStreamResult.Error     -> println("Error: ${result.throwable}")
        }
    }
}

计算单元变体

目标compute_unit备注
骁龙 NPU(推荐)"npu"nullHexagon NPU 加速。
Adreno GPU(OpenCL)"gpu"nGpuLayers = 999 全层卸载。
纯 CPU"cpu"兼容任意 ARM64 芯片。

VLM

GGUF VLM 需要两份产物:LLM 权重(model_path)与视觉投影(mmproj_path)。两者均来自 getPaths()
val paths = ModelManagerWrapper.getPaths("unsloth/Qwen3-VL-2B-Instruct-GGUF")
    ?: error("Model not downloaded")

VlmWrapper.builder()
    .vlmCreateInput(
        VlmCreateInput(
            model_name  = paths.model_name,
            model_path  = paths.model_path,
            mmproj_path = paths.mmproj_path,
            config      = ModelConfig(nCtx = 4096),
            runtime_id   = "llama_cpp",
            compute_unit   = null,
        )
    )
    .build()
    .onSuccess { vlmWrapper = it }

val msg = VlmChatMessage(
    role     = "user",
    contents = listOf(
        VlmContent("image", "/storage/emulated/0/Pictures/example.jpg"),
        VlmContent("text",  "Describe this image."),
    ),
)
val chat = arrayListOf(msg)

vlmWrapper.applyChatTemplate(chat.toTypedArray(), null, false).onSuccess { t ->
    val gen = vlmWrapper.injectMediaPathsToConfig(chat.toTypedArray(), GenerationConfig(maxTokens = 2048))
    vlmWrapper.generateStreamFlow(t.formattedText, gen).collect { result ->
        when (result) {
            is LlmStreamResult.Token     -> print(result.text)
            is LlmStreamResult.Completed -> println("\nDone")
            is LlmStreamResult.Error     -> println("Error: ${result.throwable}")
        }
    }
}
请始终把 t.formattedText(chat template 化后的 prompt)传入 generateStreamFlow不要传原始用户文本。原生流水线把 prompt 视为已格式化。

Qualcomm® AI Hub 模型(通过 Qualcomm AI Engine Direct 使用 NPU)

来自 Qualcomm AI Hub 的预编译模型。仅 NPU,绑定指定芯片(SM8750 = 骁龙 8 至尊版,SM8850 = 骁龙 8 至尊版 Gen 5)。

下载 Qualcomm AI Hub 模型

ModelManagerWrapper.pullFlow(
    ModelPullInput(
        model_name = "ai-hub-models/Qwen2.5-VL-7B-Instruct",
        hub        = HubSource.AUTO,     // AUTO routes `ai-hub-models/*` to Qualcomm AI Hub
        chipset    = "SM8750",           // REQUIRED on Android
    )
).collect { event ->
    when (event) {
        is ModelManagerWrapper.PullEvent.Progress  -> updateProgressBar(event.files)
        ModelManagerWrapper.PullEvent.Completed    -> println("done")
        is ModelManagerWrapper.PullEvent.Error     -> println("err ${event.code}: ${event.message}")
    }
}

已支持模型

模态Hub 仓库
LLMai-hub-models/Qwen3-4B-Instruct-2507
VLMai-hub-models/Qwen2.5-VL-7B-Instruct

LLM

val paths = ModelManagerWrapper.getPaths("ai-hub-models/Qwen3-4B-Instruct-2507")
    ?: error("Model not downloaded")

LlmWrapper.builder()
    .llmCreateInput(
        LlmCreateInput(
            model_name = paths.model_name,
            model_path = paths.model_path,
            config     = ModelConfig(max_tokens = 2048, enable_thinking = false),
            runtime_id  = "qairt",
            compute_unit  = null,           // null → NPU (only option for Qualcomm AI Engine Direct)
        )
    )
    .build()
    .onSuccess { llmWrapper = it }
    .onFailure { println("Error: ${it.message}") }

val chat = arrayListOf(ChatMessage("user", "What is AI?"))

llmWrapper.applyChatTemplate(chat.toTypedArray(), null, false).onSuccess { t ->
    llmWrapper.generateStreamFlow(t.formattedText, GenerationConfig()).collect { result ->
        when (result) {
            is LlmStreamResult.Token     -> print(result.text)
            is LlmStreamResult.Completed -> println("\nDone")
            is LlmStreamResult.Error     -> println("Error: ${result.throwable}")
        }
    }
}
Qualcomm AI Engine Direct 会以 PARAM_NOT_SUPPORTED 拒绝 nGpuLayers != 0nCtx != 0——KV 缓存与上下文长度由 Qualcomm AI Hub 模型包在编译期固化。两者保持默认即可,仅调整 max_tokens / enable_thinking

VLM

val paths = ModelManagerWrapper.getPaths("ai-hub-models/Qwen2.5-VL-7B-Instruct")
    ?: error("Model not downloaded")

VlmWrapper.builder()
    .vlmCreateInput(
        VlmCreateInput(
            model_name  = paths.model_name,
            model_path  = paths.model_path,
            mmproj_path = paths.mmproj_path,
            config      = ModelConfig(max_tokens = 2048, enable_thinking = false),
            runtime_id   = "qairt",
            compute_unit   = null,
        )
    )
    .build()
    .onSuccess { vlmWrapper = it }

val msg = VlmChatMessage(
    role     = "user",
    contents = listOf(
        VlmContent("image", "/storage/emulated/0/Pictures/cat.jpg"),
        VlmContent("text",  "What's in this image?"),
    ),
)
val chat = arrayListOf(msg)

vlmWrapper.applyChatTemplate(chat.toTypedArray(), null, false).onSuccess { t ->
    val gen = vlmWrapper.injectMediaPathsToConfig(chat.toTypedArray(), GenerationConfig(maxTokens = 2048))
    vlmWrapper.generateStreamFlow(t.formattedText, gen).collect { result ->
        when (result) {
            is LlmStreamResult.Token     -> print(result.text)
            is LlmStreamResult.Completed -> println("\nDone")
            is LlmStreamResult.Error     -> println("Error: ${result.throwable}")
        }
    }
}
请把 chat template 化的 promptt.formattedText)传入 generateStreamFlow,绝不要传原始用户文本。Qualcomm AI Engine Direct VLM 把 prompt 视为已格式化——传原始文本会得到退化的输出。

需要帮助?

GitHub Issues

提交 bug、提需求或浏览开放的 Issue。

Slack

开发者协作与资源。