> ## Documentation Index > Fetch the complete documentation index at: https://geniex.aihub.qualcomm.com/llms.txt > Use this file to discover all available pages before exploring further. # 模型 > 在哪里获取模型、如何运行，以及哪些精度会落到骁龙 NPU。模型可以从两个地方获取，分别对应 GenieX 的两种[运行环境](/cn/get-started/platforms#geniex-运行环境)： * **[Qualcomm AI Hub](https://aihub.qualcomm.com/models/)** —— Qualcomm AI Engine Direct 的预编译模型包，以及面向 llama.cpp 的精选 GGUF 模型。 * **[Hugging Face](https://huggingface.co/models?library=gguf)** —— llama.cpp 可运行的任意 GGUF 模型。详见[运行 Hugging Face 上的 GGUF 模型](#运行-hugging-face-上的-gguf-模型)。 ## **运行 Qualcomm AI Hub 模型** 对于 `ai-hub-models/*` 命名空间下的模型，直接输入 `geniex infer` 即可： ```powershell windows theme={"dark"} geniex infer ai-hub-models/Qwen3-4B ``` Qualcomm AI Engine Direct 仅支持 NPU。传入 `--compute npu` 或省略该标志。 ## **运行 Hugging Face 上的 GGUF 模型** Hugging Face 上任意兼容的 GGUF 仓库都可以使用。复制仓库路径并传给 `geniex infer`： ```powershell windows theme={"dark"} geniex infer /-GGUF ``` 例如： ```powershell windows theme={"dark"} geniex infer unsloth/Phi-4-mini-instruct-GGUF ``` 提示时选择： * **模型类型** —— 视觉-语言模型（Qwen3-VL 系列）选 `vlm`，其他选 `llm`。 * **精度（量化）** —— `Q4_0` 在 Hexagon NPU 上支持最佳。详见[支持的精度（量化）](#支持的精度量化)。 ### 设置 Hugging Face 令牌 Hugging Face 上的部分模型是**受限模型（gated）**——你需要在 Hugging Face 网站上接受模型许可协议，并提供访问令牌，GenieX 才能下载它们。前往 [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)，创建一个具有 **Read** 权限的新令牌。访问模型的 Hugging Face 页面（例如 `https://huggingface.co//`），如有提示请接受许可/协议。 GenieX 按以下顺序检查令牌来源（使用第一个非空值）： 1. `GENIEX_HFTOKEN` 环境变量 2. `HF_TOKEN` 环境变量 3. `~/.cache/huggingface/token` 文件 **方式 A —— 环境变量（推荐）：** ```powershell windows theme={"dark"} # 标准 HF 变量（也适用于其他 HF 工具） $env:HF_TOKEN = "hf_..." # 或使用 GenieX 专用变量（优先级高于 HF_TOKEN） $env:GENIEX_HFTOKEN = "hf_..." ``` ```bash linux theme={"dark"} # 标准 HF 变量（也适用于其他 HF 工具） export HF_TOKEN="hf_..." # 或使用 GenieX 专用变量（优先级高于 HF_TOKEN） export GENIEX_HFTOKEN="hf_..." ``` **方式 B —— Hugging Face CLI 登录（持久化到磁盘）：** ```bash theme={"dark"} pip install huggingface_hub huggingface-cli login ``` 该命令会将令牌写入 `~/.cache/huggingface/token`，GenieX 会自动读取。 `GENIEX_HFTOKEN` 优先级最高，其次是 `HF_TOKEN`，最后是缓存的令牌文件。如需为 GenieX 使用独立令牌而不影响其他 Hugging Face 工具，请使用 `GENIEX_HFTOKEN`。 ## **运行本地 Qualcomm AI Engine Direct 模型包** `ai-hub-models/*` 命名空间之外的模型包——无论是从 Hugging Face 自行转换，还是已经存放在磁盘上——都只需导入一次，之后即可像任何其他模型一样运行。用 `geniex pull --local-path` 注册模型包，然后用 `geniex infer` 运行。 **从 Hugging Face 自行转换** 确保已安装 [Hugging Face CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli)，然后下载模型包并拉取： ```powershell windows theme={"dark"} # 设置模型名称（必须与 HF 子文件夹名称完全一致） $model = "llama_v3_2_3b_instruct" # 从 Hugging Face 下载模型 hf download yichqian/geniex-qairt-models ` --local-dir $model ` --include "$model/**" # 使用下载后的文件夹拉取（绝对路径） geniex pull local/$model --local-path (Resolve-Path "$model\$model").Path # 运行推理 geniex infer local/$model ``` ```bash linux theme={"dark"} # 设置模型名称（必须与 HF 子文件夹名称完全一致） model="llama_v3_2_3b_instruct" # 从 Hugging Face 下载模型 hf download yichqian/geniex-qairt-models \ --local-dir "$model" \ --include "$model/**" # 使用下载后的文件夹拉取（绝对路径） geniex pull "local/$model" --local-path "$(realpath "$model/$model")" # 运行推理 geniex infer "local/$model" ``` **已在磁盘上** 将 `--local-path` 指向已解压的模型包目录（包含 `.bin` 分片和 `metadata.json`）： ```powershell windows theme={"dark"} geniex pull local/my-bundle --local-path C:\models\my-bundle geniex infer local/my-bundle ``` ```bash linux theme={"dark"} geniex pull local/my-bundle --local-path /home/user/models/my-bundle geniex infer local/my-bundle ``` 也可以直接从 AI Hub `.zip` 压缩包拉取，无需先解压： ```powershell windows theme={"dark"} geniex pull local/my-bundle --local-path C:\downloads\model.zip ``` ```bash linux theme={"dark"} geniex pull local/my-bundle --local-path /home/user/downloads/model.zip ``` `geniex pull` 会将模型文件复制到本地缓存。拉取成功后，可以安全删除原始下载文件以释放磁盘空间。使用 `geniex list` 确认模型已缓存。将 `from_pretrained` 指向模型包目录，并通过 `device_map="qairt"` 强制使用 Qualcomm AI Engine Direct 运行环境： ```python theme={"dark"} from geniex import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( r"C:\\Qwen3-4B-Instruct-2507", # 任意 .bin 文件或模型包文件夹 model_name="qwen3_4b_instruct_2507", # 与 Qualcomm AI Hub 模型 id 一致 device_map="qairt", # Qualcomm AI Engine Direct 必填 ) ``` 完整参数详见 [Python API 参考](/cn/run/python/api-reference)。设置 `hub = HubSource.LOCALFS`，并将 `local_path` 指向已解压的 AI Hub 目录（`metadata.json` 加一个或多个 `.bin` 分片）或 AI Hub `.zip`。`pullFlow` 会将其导入 SDK 缓存（无需联网）；无需指定 `chipset`——模型包已针对某一芯片编译。模态（LLM 与 VLM）从 `metadata.json` 读取。 ```kotlin theme={"dark"} ModelManagerWrapper.pullFlow( ModelPullInput( model_name = "local/qwen3-4b-2507", hub = HubSource.LOCALFS, local_path = "/data/local/tmp/Qwen3-4B-Instruct-2507", // 解压目录或 .zip ) ).collect { /* Progress / Completed / Error */ } ``` 然后用 `runtime_id = "qairt"` 加载——与已下载模型的流程完全一致： ```kotlin theme={"dark"} val paths = ModelManagerWrapper.getPaths("local/qwen3-4b-2507") ?: error("Model not imported") val llm = LlmWrapper.builder() .llmCreateInput( LlmCreateInput( model_name = paths.model_name, model_path = paths.model_path, config = ModelConfig(max_tokens = 2048, enable_thinking = false), runtime_id = "qairt", compute_unit = null, // qairt 仅支持 NPU ) ) .build() .getOrThrow() ``` `model_name` 只是缓存键——任意 `org/repo` 形式的字符串均可；`local/...` 前缀是常见约定。本地导入无需自行编写 `geniex.json`——模型管理器会在导入时生成。QAIRT 文件夹通过 `metadata.json` + `.bin` 分片（或 `.zip`）识别；只有一堆零散 `.bin` 文件而无 `metadata.json` 的目录不会被识别。导入混合类型的模型时，优先使用 `paths.runtime_id`（权威值），而非硬编码 `runtime_id`。 ## **运行本地 GGUF 模型** 本地 GGUF 模型是一个保存 `.gguf` 权重的目录（或文件）——可能是旁加载的、由其他工具生成的，或已经存放在磁盘上。导入一次后，即可像任何其他模型一样运行。将 `--local-path` 指向包含 `.gguf` 文件的目录： ```powershell windows theme={"dark"} geniex pull local/my-model --local-path C:\models\my-model geniex infer local/my-model ``` ```bash linux theme={"dark"} geniex pull local/my-model --local-path /home/user/models/my-model geniex infer local/my-model ``` 将本地 `.gguf` 文件的路径直接传给 `from_pretrained`： ```python theme={"dark"} from geniex import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( r"C:\models\my-model\model.gguf", # 本地 .gguf 文件路径 device_map="auto", # auto -> llama_cpp 使用 hybrid ) ``` 完整参数详见 [Python API 参考](/cn/run/python/api-reference)。设置 `hub = HubSource.LOCALFS`，并将 `local_path` 指向包含 `.gguf` 文件的目录。如存在 `geniex.json` 清单则会使用它；否则会根据文件名推断布局。对于 VLM，将 `mmproj-*.gguf` 放入同一目录。 ```kotlin theme={"dark"} ModelManagerWrapper.pullFlow( ModelPullInput( model_name = "local/qwen3-0.6b", hub = HubSource.LOCALFS, local_path = "/data/local/tmp/qwen3-0.6b", // 包含 *.gguf 的目录 ) ).collect { /* Progress / Completed / Error */ } ``` 然后用 `runtime_id = "llama_cpp"` 加载——与已下载模型的流程完全一致： ```kotlin theme={"dark"} val paths = ModelManagerWrapper.getPaths("local/qwen3-0.6b") ?: error("Model not imported") val llm = LlmWrapper.builder() .llmCreateInput( LlmCreateInput( model_name = paths.model_name, model_path = paths.model_path, config = ModelConfig(nCtx = 4096), runtime_id = "llama_cpp", compute_unit = null, // null → 骁龙上使用 NPU（推荐） ) ) .build() .getOrThrow() ``` ## **支持的精度（量化）** 所选精度决定模型运行在哪个计算平台。 ### llama.cpp llama.cpp 支持 GGML 全部量化格式。运行 `geniex pull` 拉取 GGUF 模型时，CLI 会提示你选择其中之一： ```powershell theme={"dark"} Choose a precision version to download > Q4_0 [1.2 GiB] (default) Q8_0 [2.0 GiB] F16 [3.8 GiB] ``` | 精度（量化） | 落点 | 备注 | | ------------------- | --------------- | --------------------------------- | | **`Q4_0`** *(默认)* | **Hexagon NPU** | llama.cpp 在 NPU 上的最佳支持。多数模型推荐使用。 | | `Q8_0` | GPU / CPU | 质量更好，磁盘与显存约为 2 倍。 | | `F16` | GPU / CPU | 参考精度。主要用于评估——较大且较慢。 | | `Q4_K_M`、`Q5_K_M` 等 | GPU / CPU | 混合精度 K-quants。未针对 Hexagon NPU 优化。 | 若希望模型落到 Hexagon NPU，请坚持使用 `Q4_0`。其他精度也可运行，但通常会回退到 GPU 或 CPU。 ### Qualcomm AI Engine Direct Qualcomm AI Engine Direct 模型包已**静态预量化**——该运行环境没有精度选择。如需不同精度，请从 [Qualcomm AI Hub](https://aihub.qualcomm.com/models/) 获取另一个模型包。 | 精度（量化） | 落点 | 备注 | | ------------------- | --------------- | --------------------------------------- | | **`w4a16`** *(最常见)* | **Hexagon NPU** | 权重 int4，激活 int16。端侧 LLM 在模型大小与精度间的最佳平衡。 | | `w4` | **Hexagon NPU** | 权重 int4，激活浮点。精度略优于 w4a16，但计算开销更大。 | Qualcomm AI Hub 上的预编译模型包多数使用 `w4a16`。量化级别在编译时即固化到模型包中——请根据质量/性能目标选择对应的模型包。其他模型包约束（上下文长度、KV 缓存、Android `nCtx`）参见 [Qualcomm AI Engine Direct 运行环境限制](/cn/get-started/platforms#运行环境限制)。

Was this page helpful?

Yes