top of page
Date
Model
Contributors
#Params
Input Length
Score (Avg.)
GvRp
SSFD
QMsm
SQAL
Qspr
Nrtv
QALT
MuSQ
SpDg
BkSS
05/23
GPT-4
ZeroSCROLLS team
-
8K
41.67
26.3
17.3
18.5
22.6
50.7
27.6
89.2
41.1
62.8
60.5
05/23
Claude
ZeroSCROLLS team
-
8K
39.07
24.2
16.1
14.6
21.0
52.3
32.6
84.8
36.1
61.6
47.4
05/23
ChatGPT
ZeroSCROLLS team
-
4K
34.02
21.3
16.1
15.6
20.4
49.3
25.1
66.6
27.1
49.1
49.8
05/23
DaVinci003
ZeroSCROLLS team
-
4K
33.74
21.7
16.1
16.9
22.0
52.7
24.6
69.0
33.5
31.3
49.5
05/23
Flan-UL2
ZeroSCROLLS team
20B
8K
30.62
16.1
11.5
13.6
5.7
56.9
25.5
75.6
51.3
36.0
14.0
05/23
Flan-T5
ZeroSCROLLS team
11B
8K
29.90
17.6
7.8
11.0
8.0
48.3
19.3
75.2
46.8
48.7
16.4
08/23
Stable Beluga 7B
yuzhenm
4K
23.01
13.0
13.8
14.6
17.8
28.3
15.9
51.8
18.5
46.9
9.5
05/23
Naive
ZeroSCROLLS team
-
-
19.64
22.6
6.7
6.7
10.5
6.1
2.1
26.6
20.0
45.0
50.0
09/23
Stable Beluga 13B
yuzhenm
13B
4K
16.84
6.0
7.4
12.8
13.3
20.0
13.4
47.8
25.0
14.8
7.9
05/23
T0pp
ZeroSCROLLS team
11B
8K
14.34
7.1
9.6
7.2
3.9
25.0
18.7
21.4
35.3
15.2
0.0

Click here for a downloadable version of the leaderboard with a full breakdown of results.
 

The dataset abbreviations stand for: GovReport, SummScreenFD, QMSum, SQuALITY, Qasper, NarrativeQA, QuALITY, MuSiQue, SpaceDigest, BookSumSort.

Metrics details

  • Summarization tasks (GovReport, SummScreenFD, QMSum and SQuALITY) scores are given as the geometric mean of Rouge-1/2/L

  • Qasper, NarrativeQA and MuSiQue are scored by F1

  • QuALITY is scored by accuracy

  • SpaceDigest is scored with the exponential similarity as described in the paper

  • BookSumSort score is given by concordance Index

bottom of page