Uncheatable Eval 测试
ℹ️
Uncheatable Eval (opens in a new tab) 是“无法作弊的评测”,它使用最新的论文和新闻文章等实时数据,评估开源大语言模型的真实建模能力和泛化能力。
⚠️
Uncheatable Eval 测试的结果是压缩率,因此其评分越低,意味着模型性能越好。
以下是 RWKV 和其他模型的 Uncheatable Eval 评分对比:
14B 参数模型
Mistral-Nemo-Base-2407
RWKV-6-14B-v2.1
Llama-2-13b-hf
Qwen1.5-14B
pythia-12b-v0
Average
ao3
english
bbc
news
wikipedia
english
arxiv
computer
science
arxiv
physics
github
cpp
github
python
7.11
10.07
8.08
7.95
7.42
7.66
4.20
4.37
7.61
10.19
8.52
8.34
7.92
8.04
4.93
5.33
7.68
10.52
8.28
8.19
8.07
8.31
4.93
5.43
7.70
10.88
8.88
9.10
7.75
7.86
4.67
4.74
8.36
11.29
9.19
9.53
8.54
8.40
5.43
6.13
7B 参数模型
Meta-Llama-3.1-8B
Qwen2.5-7B
Qwen2-7B
Mistral-7B-v0.1
RWKV-6-World-7B-v2.1
Yi-1.5-6B
OLMo-1.7-7B-hf
RWKV-5-World-7B-v2
Qwen1.5-7B
mpt-7B
Llama-2-7B-hf
Zamba-7B-v1
open_llama_7B_v2
falcon-7B
pythia-6.9b-v0
mamba-7B-rw
Average
ao3
english
bbc
news
wikipedia
english
arxiv
computer
science
arxiv
physics
github
cpp
github
python
7.23
10.53
8.20
7.93
7.55
7.69
4.20
4.49
7.45
10.93
8.74
8.81
7.68
7.99
3.92
4.09
7.53
10.81
8.58
8.71
7.82
8.25
4.20
4.32
7.58
10.55
8.28
8.44
7.85
8.09
4.80
5.05
7.82
10.41
8.74
8.58
8.11
8.25
5.12
5.52
7.83
10.93
8.79
8.95
8.10
8.41
4.75
4.89
7.88
11.01
8.65
8.99
8.00
8.20
4.97
5.35
7.91
10.49
8.88
8.69
8.18
8.31
5.19
5.59
7.92
11.10
9.13
9.36
7.95
8.11
4.84
4.92
7.95
11.19
8.68
8.77
8.16
8.44
4.95
5.47
7.97
10.84
8.51
8.52
8.33
8.63
5.24
5.70
8.09
10.85
8.52
8.64
8.06
8.29
5.94
6.33
8.10
11.09
8.84
9.05
8.40
8.76
4.89
5.70
8.30
10.76
8.69
9.15
8.55
9.06
5.76
6.15
8.54
11.49
9.38
9.76
8.68
8.57
5.61
6.32
9.78
10.81
8.55
8.99
8.61
9.14
11.11
11.27
3B 参数模型
stablelm-3b-4e1t
Minitron-4B-Base
recurrentgemma-2b
Qwen1.5-4B
Llama-3.1-Minitron-4B
RWKV-6-World-3B-v2.1
Phi-3-mini-4k-instruct
mamba2attn-2.7B
gemma-2b
RWKV-5-World-3B-v2
open_llama_3b_v2
mamba2-2.7B
Phi-3.5-mini-instruct
Zamba2-2.7B
mamba-2.8b-hf
RWKV-4-World-3B-v1
pythia-2.8b-v0
RedPajama-INCITE-Base-3B-v1
phi-2
btlm-3b-8k-base
RWKV-4-Pile-3B
Sheared-LLaMA-2.7B
MiniCPM3-4B
mamba-2.8b-slimpj
OpenELM-3B
Average
ao3
english
bbc
news
wikipedia
english
arxiv
computer
science
arxiv
physics
github
cpp
github
python
8.03
11.05
8.79
8.83
8.36
8.64
4.97
5.58
8.14
11.22
9.04
9.20
8.30
8.86
4.95
5.42
8.17
11.45
8.92
9.16
8.45
8.66
5.01
5.51
8.25
11.61
9.34
9.78
8.23
8.45
5.14
5.20
8.26
11.40
9.26
9.40
8.46
9.07
4.91
5.30
8.26
10.84
9.16
9.10
8.49
8.72
5.57
5.96
8.33
11.98
9.19
9.30
8.39
9.02
5.44
5.02
8.36
11.29
9.23
9.59
8.52
8.40
5.41
6.08
8.38
11.74
9.14
9.39
8.69
8.88
5.19
5.67
8.41
10.99
9.34
9.30
8.62
8.84
5.70
6.08
8.46
11.47
9.15
9.47
8.74
9.16
5.21
6.02
8.47
11.38
9.32
9.72
8.62
8.49
5.52
6.24
8.48
12.16
9.31
9.43
8.57
9.15
5.51
5.20
8.57
11.17
8.93
9.14
8.42
8.88
6.78
6.67
8.59
11.46
9.43
9.87
8.76
8.64
5.64
6.35
8.71
11.04
9.51
9.59
9.13
9.43
5.85
6.38
8.85
11.81
9.68
10.15
8.92
8.86
5.89
6.60
8.87
11.66
9.13
9.29
8.88
9.21
6.62
7.29
8.91
12.28
9.28
9.58
8.81
9.86
6.77
5.79
8.96
11.81
9.08
9.10
8.57
8.88
7.46
7.81
9.02
11.79
9.76
10.40
9.20
9.06
6.08
6.85
9.10
11.58
9.15
9.61
9.11
9.65
7.07
7.54
9.12
13.23
10.71
10.72
8.75
9.03
5.54
5.84
9.25
13.81
9.66
9.15
8.71
8.92
7.04
7.43
9.68
14.05
10.08
9.97
9.16
9.52
7.40
7.57
1.6B 参数模型
Qwen2-1.5B
Index-1.9B
stablelm-2-1_6b
Rene-v0.1-1.3b-pytorch
RWKV-6-World-1B6-v2.1
RWKV-5-World-1B5-v2
mamba2-1.3b
mamba-1.4b-hf
TinyLlama-1.1B-intermediate-3T
Qwen1.5-1.8B
RWKV-4-World-1.5B-v1
OLMo-1B-hf
Qwen-1_8B
pythia-1.4b-v0
RWKV-4-Pile-1B5
h2o-danube-1.8b-base
Sheared-LLaMA-1.3B
bloom-1b7
OpenELM-1_1B
TransNormerLLM-1B
phi-1_5
falcon-rw-1b
Average
ao3
english
bbc
news
wikipedia
english
arxiv
computer
science
arxiv
physics
github
cpp
github
python
8.46
11.78
9.44
9.88
8.75
9.32
5.02
5.01
8.49
11.66
9.49
9.52
8.55
8.72
5.53
5.96
8.53
11.56
9.21
9.38
8.79
9.25
5.65
5.88
8.56
11.62
9.20
9.84
8.65
9.09
5.54
5.97
8.68
11.24
9.55
9.60
8.84
9.14
6.00
6.36
8.83
11.39
9.74
9.79
8.97
9.26
6.13
6.50
8.86
11.78
9.69
10.20
8.94
8.86
5.90
6.62
8.97
11.85
9.77
10.32
9.08
8.99
6.01
6.75
8.99
12.40
9.74
10.00
9.30
9.78
5.64
6.05
9.14
12.45
9.89
10.61
9.14
9.72
5.93
6.26
9.19
11.46
9.94
10.11
9.57
9.96
6.39
6.87
9.20
12.14
9.61
10.34
9.26
9.91
6.31
6.82
9.33
12.60
9.99
10.77
9.32
9.96
6.16
6.48
9.33
12.32
10.13
10.78
9.32
9.33
6.34
7.11
9.46
12.21
10.16
10.88
9.59
9.48
6.56
7.33
9.72
11.65
9.18
9.52
9.34
10.04
9.29
9.02
9.78
12.19
9.70
10.30
9.68
10.38
7.89
8.34
9.82
13.43
10.92
11.41
9.49
9.85
6.38
7.26
10.19
14.93
10.58
10.52
9.49
10.01
7.88
7.95
10.45
12.89
10.50
11.10
10.67
11.36
7.73
8.88
10.50
13.45
11.08
13.51
9.99
11.72
7.31
6.42
12.12
12.06
9.56
10.49
9.63
10.52
16.31
16.24
MMLU 测试
ℹ️
MMLU 测试(Massive Multitask Language Understanding)是一项用于评估大型语言模型(LLMs)在广泛任务上的多任务语言理解能力的基准测试。
MMLU 涵盖了从初中到研究生水平的57个不同学科,包括数学、物理、历史、法律、生物学等,测试语言模型是否能够在不同领域内进行推理、回答问题和表现出跨学科的知识。
如果使用 lm_eval 的标准格式测试,RWKV-6-World-7B-v2.1 的 MMLU 准确度是 42.8% :
The following are multiple choice questions (with answers) about abstract algebra.
Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
A. 0
B. 4
C. 2
D. 6
Answer:
如果使用 RWKV 的训练的数据格式作为 prompt,RWKV-6-World-7B-v2.1 的 MMLU 是 46.7%:
User: Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.
A. 0
B. 4
C. 2
D. 6
Assistant: The answer is
如果使用最适合 RWKV 模型推理的 prompt 模板,RWKV-6-World-7B-v2.1 的 MMLU 是 47.9%:
User: You are a very talented expert in <SUBJECT>. Answer this question:
<Question>
A. <|A|>
B. <|B|>
C. <|C|>
D. <|D|>
Assistant: The answer is