llama.cpp推理 - RWKV推理文档

ℹ️

llama.cpp (opens in a new tab) 是一个轻量化的大语言模型运行框架，专门优化了在 CPU 上运行模型的性能。

随着 RWKV 社区成员 @MollySophia (opens in a new tab) 的工作，llama.cpp 现已适配 RWKV-6 模型。

本章节介绍如何在 llama.cpp 中使用 RWKV-6 模型进行推理。

视频教程

ℹ️

高画质视频请跳转到 B 站 (opens in a new tab)观看。

llama.cpp 推理 RWKV 模型

本地构建 llama.cpp

可以选择从 llama.cpp 的 release 页面 (opens in a new tab)下载已编译的 llama.cpp 程序。

llama.cpp 提供了多种预编译版本，根据你的显卡类型选择合适的版本：

系统类型	GPU 类型	包名称字段
macOS	苹果芯片	macos-arm64.zip
Windows	英特尔 GPU（含 Arc 独显/Xe 核显）	win-sycl-x64.zip
Windows	英伟达 GPU（CUDA 11.7-12.3）	win-cuda-cu11.7-x64.zip
Windows	英伟达 GPU（CUDA 12.4+）	win-cuda-cu12.4-x64.zip
Windows	AMD 和其他 GPU（含 AMD 核显）	win-vulkan-x64.zip
Windows	无 GPU	win-openblas-x64.zip

也可以参照 llama.cpp 官方构建文档 (opens in a new tab)，选择适合的方法本地编译构建。

获取 gguf 格式模型

llama.cpp 支持 .gguf 格式的模型，但 RWKV 官方仅发布了 .pth 格式模型。因此，我们需要使用以下两种方法之一获取 .gguf 格式的 RWKV 模型。

可以从 RWKV-6-World-GGUF 仓库 (opens in a new tab) 下载已量化并转化成 gguf 格式的 RWKV 模型。

ℹ️

请在 llama.cpp 目录下新建一个 models 文件夹，将下载的 gguf 模型放入 models 文件夹中。

⚠️

RWKV gguf 模型有多种量化类型，精度越高模型的回复效果越好，但模型体积和计算要求越高。

推荐顺序：FP16 > Q8_0 > Q5_K_M > Q4_K_M ，更低的量化精度（如 Q3_0、Q2_0 等）可能会大大降低模型的性能，甚至无法正常回答问题。

运行 RWKV 模型进行对话

在 llama.cpp 目录运行以下命令，可以开启 llama.cpp 的命令行对话模式：

./llama-cli -m models/rwkv-6-world-7b-Q8_0.gguf -p "You are a helpful assistant" -cnv -t 8 -ngl 99 -n 500

这条命令通过 llama-cli 运行 models/rwkv-6-world-7b-Q8_0.gguf 模型，使用 8 个线程、并根据给定的初始 prompt You are a helpful assistant 开启对话，每次生成最多 500 个 token。

llama.cpp-RWKV-inference-cnv-mode

参数解释：

./llama-cli：启动编译好的 llama-cli 程序
-m models/rwkv-6-world-7b-Q8_0.gguf：模型的路径参数
-p "You are a helpful assistant"：初始 prompt 参数，模型根据该提示词开启对话
-cnv：开启对话模式，此参数为默认参数，可以省略
-t 8：指定线程数，建议根据可用的物理 CPU 核心数调整
-ngl：指定使用 GPU 加载的模型层数，可以通过设定 -ngl 99，使用 GPU 加载 RWKV 模型所有层
-n 500：-n 参数表示模型生成的最大 token 数

ℹ️

完整的参数列表可以在 llama.cpp 参数文档 (opens in a new tab)中查看。

附加功能（可选）

启用续写模式

ℹ️

./llama-cli 默认是对话模式。可通过添加 -no-cnv 参数设置为续写模式，根据给定的 prompt 继续生成文本。

./llama-cli -m models/rwkv-6-world-7b-Q8_0.gguf -p "User: What's mbti?tell me in chinese.\n\nAssistant:" -no-cnv -t 8 -ngl 99 -n 500

RWKV 模型续写推理

-p "User: What's mbti?tell me in chinese.\n\nAssistant:"： prompt 参数，模型根据该提示词进行续写。更多 RWKV prompt 格式请在 RWKV 文档-提示词指南 (opens in a new tab)中查看
-no-cnv 参数：关闭对话模式，设置 llama.cpp 为续写模式，模型会根据给定的 prompt 继续生成文本
其他参数：与对话模式相同

启动 Web 服务（推荐）

使用以下命令，启动 llama.cpp 的 Web 服务：

./llama-server -m models/rwkv-6-world-7b-Q8_0.gguf -ngl 99

启动后，可以通过 http://127.0.0.1:8080 访问 llama.cpp Web 页面：

WebUI

量化 gguf 模型

在 llama.cpp 目录运行 ./llama-quantize [input_model] [output_model] [quantization_type] 命令，可以对 fp32 和 fp16 的 .gguf 模型进行量化，例如：

./llama-quantize models/rwkv-6-world-1.6b-F16.gguf models/rwkv-6-world-1b6-Q8_0.gguf Q8_0

⚠️

输入模型的精度限制为 fp32 和 fp16，推荐使用 Q5_1、 Q8_0 两种量化精度。

使用 ./llama-quantize --help 命令，查看所有可选的量化精度：

可选的量化精度

附录代码

`.pth` 模型转 HF 格式

以下代码可将 .pth 模型转化为 HF 格式模型（.bin）：

# Convert the model for the pytoch_model.bin
import torch
 
SOURCE_MODEL="./v6-FinchX-14B-pth/rwkv-14b-final.pth"
TARGET_MODEL="./v6-Finch-14B-HF/pytorch_model.bin"
 
# delete target model
import os
if os.path.exists(TARGET_MODEL):
    os.remove(TARGET_MODEL)
 
model = torch.load(SOURCE_MODEL, mmap=True, map_location='cpu')
# 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
# Rename all the keys, to include "rwkv."
new_model = {}
for key in model.keys():
 
    # If the keys start with "blocks"
    if key.startswith("blocks."):
        new_key = "rwkv." + key
        # Replace .att. with .attention.
        new_key = new_key.replace(".att.", ".attention.")
        # Replace .ffn. with .feed_forward.
        new_key = new_key.replace(".ffn.", ".feed_forward.")
        # Replace `0.ln0.` with `0.pre_ln.`
        new_key = new_key.replace("0.ln0.", "0.pre_ln.")
    else:
        # No rename needed
        new_key = key
 
        # Rename `emb.weight` to `rwkv.embeddings.weight`
        if key == "emb.weight":
            new_key = "rwkv.embeddings.weight"
 
        # Rename the `ln_out.x` to `rwkv.ln_out.x
        if key.startswith("ln_out."):
            new_key = "rwkv." + key
 
    print("Renaming key:", key, "--to-->", new_key)
    new_model[new_key] = model[key]
 
# Save the new model
print("Saving the new model to:", TARGET_MODEL)
torch.save(new_model, TARGET_MODEL)

KoboldCPP 推理教程