top of page
Date
Model
Contributors
#Params
Input Length
Score (Avg.)
GvRp
SSFD
QMsm
SQAL
Qspr
Nrtv
QALT
MuSQ
SpDg
BkSS
05/23
GPT-4
ZeroSCROLLS team
-
8K
41.67
26.3
17.3
18.5
22.6
50.7
27.6
89.2
41.1
62.8
60.5
05/23
Claude
ZeroSCROLLS team
-
8K
39.07
24.2
16.1
14.6
21.0
52.3
32.6
84.8
36.1
61.6
47.4
07/23
Llama 2 Long
Meta
70B
16K
37.71
26.0
15.0
20.0
20.9
52.0
31.7
82.6
27.3
55.5
46.2
05/23
ChatGPT
ZeroSCROLLS team
-
4K
34.02
21.3
16.1
15.6
20.4
49.3
25.1
66.6
27.1
49.1
49.8
05/23
DaVinci003
ZeroSCROLLS team
-
4K
33.74
21.7
16.1
16.9
22.0
52.7
24.6
69.0
33.5
31.3
49.5
05/23
Flan-UL2
ZeroSCROLLS team
20B
8K
30.62
16.1
11.5
13.6
5.7
56.9
25.5
75.6
51.3
36.0
14.0
05/23
Flan-T5
ZeroSCROLLS team
11B
8K
29.90
17.6
7.8
11.0
8.0
48.3
19.3
75.2
46.8
48.7
16.4
08/23
Stable Beluga 7B
yuzhenm
4K
23.01
13.0
13.8
14.6
17.8
28.3
15.9
51.8
18.5
46.9
9.5
04/24
por_ms
1
1
1
20.45
21.8
13.6
15.5
21.0
21.2
17.7
47.8
9.3
33.4
3.4
11/23
GPT4 Turbo
Tam Doan
2.40
24.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
10/23
graph
Tam Doan
2.37
23.7
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
10/23
GPT4
Tam Doan
2.29
22.9
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
05/23
Naive
ZeroSCROLLS team
-
-
19.64
22.6
6.7
6.7
10.5
6.1
2.1
26.6
20.0
45.0
50.0
04/24
llama2_H2O_final
zwang
19.41
15.4
13.2
14.3
18.3
20.5
15.0
43.2
9.5
40.8
3.8
04/24
3-4-open
1
1
4k
19.26
22.6
13.7
15.4
21.1
24.1
17.6
43.8
8.1
24.4
1.8
04/24
llama2_7B_chat_ours
zwang33
18.96
15.2
11.9
14.3
17.9
19.7
15.1
42.8
9.9
39.0
3.6
09/23
Stable Beluga 13B
yuzhenm
13B
4K
16.84
6.0
7.4
12.8
13.3
20.0
13.4
47.8
25.0
14.8
7.9
03/24
llama2chat_bestbase_1e-4_7top3_bs8_ratio1_gate_v3_3-pretrain-4
1
1
1
14.99
10.6
13.0
15.6
18.9
23.1
17.7
40.0
10.9
0.0
0.0
03/24
1
1
1
1
14.74
10.7
13.4
15.9
19.2
21.7
18.7
39.0
8.8
0.0
0.0
05/23
T0pp
ZeroSCROLLS team
11B
8K
14.34
7.1
9.6
7.2
3.9
25.0
18.7
21.4
35.3
15.2
0.0

Click here for a downloadable version of the leaderboard with a full breakdown of results.
 

The dataset abbreviations stand for: GovReport, SummScreenFD, QMSum, SQuALITY, Qasper, NarrativeQA, QuALITY, MuSiQue, SpaceDigest, BookSumSort.

Metrics details

  • Summarization tasks (GovReport, SummScreenFD, QMSum and SQuALITY) scores are given as the geometric mean of Rouge-1/2/L

  • Qasper, NarrativeQA and MuSiQue are scored by F1

  • QuALITY is scored by accuracy

  • SpaceDigest is scored with the exponential similarity as described in the paper

  • BookSumSort score is given by concordance Index

bottom of page