RWKV

基准测试数据

Uncheatable Eval 测试

Uncheatable Eval 是“无法作弊的评测”,它使用最新的论文和新闻文章等实时数据,评估开源大语言模型的真实建模能力和泛化能力。

Uncheatable Eval 测试的结果是压缩率,因此其评分越低,意味着模型性能越好。

以下是 RWKV 和其他模型的 Uncheatable Eval 评分对比:

14B 参数模型

Qwen3-14B-Base
rwkv7-g0b-13.3b-20251114
gemma-3-12b-pt
Qwen2.5-14B
Mistral-Nemo-Base-2407
Motif-2-12.7B-Base
Llama-2-13b-hf
Average
ao3 english
bbc news
wikipedia english
arxiv computer science
arxiv physics
github cpp
github python
6.85
10.57
8.45
7.94
7.00
7.21
3.44
3.31
6.87
9.85
8.20
7.64
7.11
7.38
4.03
3.89
6.95
10.54
7.91
7.61
7.29
7.39
3.88
4.00
6.95
10.56
8.32
7.94
7.22
7.39
3.63
3.60
6.97
10.16
8.12
7.64
7.29
7.46
4.08
4.04
7.10
10.63
8.33
7.90
7.13
7.40
4.19
4.11
7.54
10.66
8.31
7.90
7.99
8.12
4.80
5.01

为了更直观地观察模型在每项评测上的表现,我们对原始数据进行了归一化处理。

NameParams (B)Average (lower=better)ao3 englishbbc newswikipedia englisharxiv computer sciencearxiv physicsgithub cppgithub python
Qwen3-14B-Base14.7686.84510.5698.4457.9427.0017.2103.4393.312
rwkv7-g0b-13.3b13.2696.8709.8488.2027.6367.1087.3804.0263.892
gemma-3-12b-pt12.1876.94510.5407.9147.6077.2867.3873.8833.997
Qwen2.5-14B14.7706.95110.5588.3177.9447.2247.3923.6253.599
Mistral-Nemo-Base-240712.2486.97010.1658.1187.6427.2877.4554.0794.042
Motif-2-12.7B-Base12.7047.09910.6288.3287.8977.1347.4044.1894.114
Llama-2-13b-hf13.0167.54010.6558.3077.9017.9938.1224.7955.009

7B 参数模型

Qwen3-8B-Base
Meta-Llama-3-8B
RWKV7-g0a3-7.2b-20251029
Qwen2.5-7B
Falcon-H1-7B-Base
Mistral-7B-v0.1
Hunyuan-7B-Pretrain
falcon-mamba-7b
Zamba2-7B
Minitron-8B-Base
Olmo-3-1025-7B
RWKV-x060-World-7B-v3-20241112
Average
ao3 english
bbc news
wikipedia english
arxiv computer science
arxiv physics
github cpp
github python
7.09
10.89
8.72
8.26
7.21
7.47
3.62
3.48
7.16
10.62
8.30
7.79
7.54
7.54
4.17
4.18
7.22
10.16
8.48
8.00
7.44
7.75
4.38
4.35
7.32
11.08
8.73
8.45
7.54
7.79
3.87
3.81
7.34
10.96
8.58
8.23
7.40
7.57
4.25
4.39
7.41
10.66
8.31
7.98
7.75
7.90
4.61
4.64
7.54
11.51
8.99
8.50
7.65
8.11
4.20
3.83
7.55
10.76
8.96
8.59
7.67
7.74
4.44
4.68
7.58
10.70
8.63
8.07
7.84
8.12
4.83
4.87
7.58
10.84
8.65
8.28
7.86
8.23
4.51
4.71
7.60
11.10
8.78
8.52
7.49
7.95
4.93
4.39
7.63
10.63
8.75
8.29
7.94
8.11
4.79
4.93

为了更直观地观察模型在每项评测上的表现,我们抽取了平均分排行前七的模型,并对原始数据进行了归一化处理。

NameParams (B)Average (lower=better)ao3 englishbbc newswikipedia englisharxiv computer sciencearxiv physicsgithub cppgithub python
Qwen3-8B-Base8.1917.09110.8908.7188.2557.2077.4653.6173.482
Meta-Llama-3-8B8.0307.16210.6198.2957.7857.5367.5414.1744.181
RWKV7-g0a3-7.2b-20251029-ctx81927.1997.22210.1648.4807.9967.4407.7474.3784.347
Qwen2.5-7B7.6167.32311.0798.7298.4497.5397.7923.8683.806
Falcon-H1-7B-Base7.5867.33910.9588.5768.2257.4037.5694.2514.392
Mistral-7B-v0.17.2427.40610.6628.3067.9767.7457.9034.6124.635
Hunyuan-7B-Pretrain7.5057.54111.5098.9878.4997.6538.1084.2013.829
falcon-mamba-7b7.2737.54810.7608.9588.5897.6747.7374.4374.680
Zamba2-7B7.3577.58210.7028.6278.0747.8438.1244.8334.869
Minitron-8B-Base8.2727.58210.8358.6548.2847.8568.2304.5084.708
Olmo-3-1025-7B7.2987.59511.1018.7848.5227.4907.9474.9304.394
RWKV-x060-World-7B-v3-20241112-ctx40967.6367.63310.6298.7538.2887.9368.1094.7864.929

3B 参数模型

RWKV7-g1a4-2.9b-20251118
Llama-3.2-3B
Qwen2.5-3B
SmolLM3-3B-Base
RWKV-x070-World-2.9B-v3-20250211
stablelm-3b-4e1t
Falcon-H1-3B-Base
recurrentgemma-2b
RWKV-x060-World-3B-v2.1-20240417
mamba2attn-2.7b
Average
ao3 english
bbc news
wikipedia english
arxiv computer science
arxiv physics
github cpp
github python
7.49
10.48
8.80
8.31
7.71
8.07
4.55
4.47
7.64
11.22
8.70
8.37
7.93
8.07
4.66
4.56
7.72
11.58
9.14
8.90
7.91
8.22
4.20
4.11
7.78
11.19
8.90
8.61
8.10
8.63
4.51
4.55
7.80
10.81
8.91
8.50
8.05
8.31
4.96
5.07
7.91
11.21
8.82
8.43
8.30
8.48
4.91
5.21
7.94
11.69
9.16
8.91
7.89
8.16
4.83
4.92
8.05
11.63
8.95
8.84
8.40
8.49
4.90
5.16
8.15
11.01
9.16
8.82
8.45
8.56
5.48
5.56
8.20
11.44
9.25
8.95
8.47
8.24
5.34
5.75

为了更直观地观察模型在每项评测上的表现,我们抽取了平均分排行前七的模型,并对原始数据进行了归一化处理。

NameParams (B)Average (lower=better)ao3 englishbbc newswikipedia englisharxiv computer sciencearxiv physicsgithub cppgithub python
RWKV7-g1a4-2.9b-20251118-ctx81922.9487.48610.4818.8008.3107.7128.0724.5534.474
Llama-3.2-3B3.2137.64311.2198.7018.3657.9288.0654.6614.562
Qwen2.5-3B3.0867.72211.5759.1398.8957.9118.2204.2034.113
SmolLM3-3B-Base3.0757.78411.1878.9058.6118.0978.6314.5134.546
RWKV-x070-World-2.9B-v3-20250211-ctx40962.9487.80010.8128.9098.5018.0498.3074.9555.066
stablelm-3b-4e1t2.7957.90711.2118.8158.4348.2998.4764.9065.207
Falcon-H1-3B-Base3.1497.93611.6859.1588.9107.8918.1614.8324.917
recurrentgemma-2b2.6838.05211.6328.9518.8358.4018.4884.8975.157
RWKV-x060-World-3B-v2.1-20240417-ctx40963.1008.14711.0059.1618.8158.4518.5595.4795.561
mamba2attn-2.7b2.6988.20411.4369.2468.9478.4748.2365.3365.751

1.5B 参数模型

Qwen3-1.7B-Base
rwkv7-g1b-1.5b-20251015
Qwen2.5-1.5B
RWKV-x070-World-1.5B-v3-20250127
SmolLM2-1.7B
Llama-3.2-1B
Index-1.9B
stablelm-2-1_6b
Falcon-H1-1.5B-Deep-Base
RWKV-x060-World-1B6-v2.1-20240328
Falcon-H1-1.5B-Base
mamba2-1.3b
RWKV-5-World-1B5-v2-20231025
mamba-1.4b-hf
Average
ao3 english
bbc news
wikipedia english
arxiv computer science
arxiv physics
github cpp
github python
7.97
12.02
9.74
9.35
7.94
8.35
4.26
4.10
7.97
10.97
9.25
8.84
8.11
8.54
5.04
5.03
8.12
12.11
9.56
9.39
8.27
8.65
4.50
4.38
8.23
11.27
9.32
8.97
8.43
8.76
5.39
5.48
8.30
11.54
9.37
9.35
8.55
9.05
5.08
5.15
8.31
12.04
9.33
9.10
8.56
8.76
5.27
5.10
8.34
11.83
9.49
9.07
8.50
8.56
5.38
5.55
8.40
11.76
9.24
8.94
8.76
9.09
5.56
5.43
8.51
12.14
9.67
9.48
8.41
8.97
5.50
5.37
8.56
11.43
9.56
9.28
8.82
8.99
5.91
5.97
8.64
12.29
9.80
9.65
8.51
9.09
5.64
5.51
8.70
11.94
9.71
9.46
8.93
8.71
5.85
6.29
8.72
11.60
9.73
9.45
8.98
9.10
6.04
6.11
8.81
12.03
9.78
9.55
9.08
8.84
5.96
6.41

为了更直观地观察模型在每项评测上的表现,我们抽取了平均分排行前七的模型,并对原始数据进行了归一化处理。

NameParams (B)Average (lower=better)ao3 englishbbc newswikipedia englisharxiv computer sciencearxiv physicsgithub cppgithub python
Qwen3-1.7B-Base1.7217.96512.0169.7439.3527.9368.3504.2604.095
rwkv7-g1b-1.5b-20251015-ctx81921.5277.96910.9729.2508.8438.1108.5375.0415.027
Qwen2.5-1.5B1.5448.12412.1149.5629.3938.2708.6464.5024.384
RWKV-x070-World-1.5B-v3-20250127-ctx40961.5278.23111.2739.3208.9658.4318.7585.3855.483
SmolLM2-1.7B1.7118.29811.5369.3739.3518.5479.0475.0805.152
Llama-3.2-1B1.2368.30612.0369.3319.0978.5568.7555.2675.101
Index-1.9B2.1738.34011.8319.4939.0698.4978.5615.3805.547
stablelm-2-1_6b1.6458.39611.7619.2378.9438.7629.0885.5585.425
Falcon-H1-1.5B-Deep-Base1.5558.50512.1449.6669.4828.4078.9685.4975.368
RWKV-x060-World-1B6-v2.1-20240328-ctx40961.6008.56411.4349.5559.2768.8228.9905.9065.968
Falcon-H1-1.5B-Base1.5558.63912.2879.7969.6458.5079.0895.6355.514
mamba2-1.3b1.3448.69911.9449.7109.4638.9258.7145.8516.286
RWKV-5-World-1B5-v2-20231025-ctx40961.5788.71511.5959.7319.4518.9779.1036.0396.110
mamba-1.4b-hf1.3728.80612.0269.7839.5529.0818.8365.9586.408

MMLU 测试

MMLU 测试(Massive Multitask Language Understanding)评估模型的多任务语言理解能力的基准测试。MMLU 涵盖了从初中到研究生水平的 57 个不同学科,包括数学、物理、历史、法律、生物学等,测试语言模型是否能够在不同领域内进行推理、回答问题和表现出跨学科的知识。

ModelMMLUMMLU COT
rwkv7-g0b-13.3b0.7650.827
rwkv7-g0b-7.2b0.6600.731
rwkv7-g1b-2.9b0.6220.669
rwkv7-g1b-1.5b0.5050.542

MMLU Pro 测试

MMLU-Pro 是一个更稳健且更具挑战性的大规模多任务理解数据集,更严格地评测大型语言模型的能力。该数据集包含涵盖各个学科的 1.2 万个复杂问题。

ModelMMLU-PROMMLU-PRO COT
rwkv7-g0b-13.3b0.5020.612
rwkv7-g0b-7.2b0.3610.526
rwkv7-g1b-2.9b0.3220.433
rwkv7-g1b-1.5b0.2220.292

MMLU 变体测试

MMLU Redux 是 MMLU 的精简和修正版本,移除了部分错误标签;MMMLU (Multilingual MMLU) 则是多语言版本的测试。

ModelMMLU ReduxRedux COTMMMLUMMMLU COT
rwkv7-g0b-13.3b0.7870.8630.7650.827
rwkv7-g0b-7.2b0.6850.7720.6600.731
rwkv7-g1b-2.9b0.6410.6990.6220.669
rwkv7-g1b-1.5b0.5290.5690.5050.541

GSM8K 测试

GSM8K (Grade School Math 8K) 是一个包含 8500 道高质量、语言表达多样的小学数学应用题数据集,用于评估模型的数学推理能力。

ModelGSM8K COT
rwkv7-g0b-13.3b0.923
rwkv7-g0b-7.2b0.851
rwkv7-g1b-2.9b0.766
rwkv7-g1b-1.5b0.585

MATH500 测试

MATH-500 是衡量 AI 模型数学推理能力的权威基准测试,包含 500 道具有挑战性的数学问题,涵盖代数、几何、微积分、概率统计等多个领域。

ModelMATH500 COT
rwkv7-g0b-13.3b0.768
rwkv7-g0b-7.2b0.635
rwkv7-g1b-2.9b0.495
rwkv7-g1b-1.5b0.298

通用数学与推理测试

包含 Hendrycks Math, SVAMP, ASDiv, MAWPS 等数据集,以及 Algebra 222 和 Math Odyssey。涵盖了从基础算术、代数到几何等多种类型的数学问题,重点考察模型的思维链(CoT)能力。

ModelHendrycks MathSVAMPASDivMAWPS
rwkv7-g0b-13.3b0.5580.9420.9230.949
rwkv7-g0b-7.2b0.4880.9260.9050.924
rwkv7-g1b-2.9b0.3830.8400.8440.890
rwkv7-g1b-1.5b0.2460.6830.7350.813
ModelGSM+Algebra 222Math Odyssey
rwkv7-g0b-13.3b0.7670.8920.494
rwkv7-g0b-7.2b0.6850.8560.401
rwkv7-g1b-2.9b0.5750.7950.320
rwkv7-g1b-1.5b0.4060.6580.202

代码生成能力测试

包含 HumanEval 和 MBPP (Mostly Basic Python Problems) 及其扩展版本。这些测试评估模型将自然语言描述转换为可执行代码的能力,涵盖了基础编程逻辑到复杂算法实现。

ModelHumanEvalHumanEval+MBPPMBPP+
rwkv7-g0b-13.3b0.8170.7620.8200.706
rwkv7-g0b-7.2b0.6400.6040.7570.640
rwkv7-g1b-2.9b0.5370.4940.6270.550
rwkv7-g1b-1.5b0.3960.3480.4390.368

HumanEval Fix / CN 补充数据:

  • rwkv7-g0b-13.3b: Fix 0.823 / CN 0.799
  • rwkv7-g0b-7.2b: Fix 0.585 / CN 0.659
  • rwkv7-g1b-2.9b: Fix 0.524 / CN 0.524
  • rwkv7-g1b-1.5b: Fix 0.323 / CN 0.390

中文综合能力测试

C-Eval 和 CMMLU 是针对中文大模型的综合性能力评测基准,涵盖了人文、社科、理工等多个学科,考察模型在中文语境下的知识和推理能力。

ModelC-EvalC-Eval COTCMMLUCMMLU COT
rwkv7-g0b-13.3b0.6400.6740.6670.689
rwkv7-g0b-7.2b0.5400.5630.5640.598
rwkv7-g1b-2.9b0.4960.5130.5230.556
rwkv7-g1b-1.5b0.4270.4260.4220.442

Gaokao 2023 English COT (高考英语): 13.3b: 0.665 | 7.2b: 0.535 | 2.9b: 0.447 | 1.5b: 0.273

IFEval

IFEval(Instruction-Following Evaluation)是一个专门用于评估大模型指令跟随能力的基准数据集。核心特点是使用可验证指令(verifiable instructions),通过确定性规则(如正则匹配、计数、格式检查)自动判断模型是否遵循指令。

ModelIFEval (strict prompt-level)
rwkv7-g0b-13.3b0.689
rwkv7-g0b-7.2b0.579
rwkv7-g1b-2.9b0.494
rwkv7-g1b-1.5b0.421

GPQA 测试

GPQA (Graduate-Level Google-Proof Q&A Benchmark) 是一个极具挑战性的问答数据集,包含生物、物理和化学领域的研究生水平问题。这些问题设计为"Google-Proof",即通过简单的搜索引擎查询很难找到直接答案。

ModelGPQA MainGPQA Main COTSuperGPQA
rwkv7-g0b-13.3b0.3790.4290.288
rwkv7-g0b-7.2b0.3080.3170.220
rwkv7-g1b-2.9b0.3330.3260.194
rwkv7-g1b-1.5b0.2830.2790.158
这份文档对您有帮助吗?

意见反馈

联系方式(可选)

© 2026 RWKV. All rights reserved.粤ICP备2024242518号-1