top of page
Date | Model | Contributors | #Params | Input Length | Score (Avg.) | GvRp | SSFD | QMsm | SQAL | Qspr | Nrtv | QALT | MuSQ | SpDg | BkSS |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
05/23 | GPT-4 | ZeroSCROLLS team | - | 8K | 41.67 | 26.3 | 17.3 | 18.5 | 22.6 | 50.7 | 27.6 | 89.2 | 41.1 | 62.8 | 60.5 |
05/23 | Claude | ZeroSCROLLS team | - | 8K | 39.07 | 24.2 | 16.1 | 14.6 | 21.0 | 52.3 | 32.6 | 84.8 | 36.1 | 61.6 | 47.4 |
05/23 | ChatGPT | ZeroSCROLLS team | - | 4K | 34.02 | 21.3 | 16.1 | 15.6 | 20.4 | 49.3 | 25.1 | 66.6 | 27.1 | 49.1 | 49.8 |
05/23 | DaVinci003 | ZeroSCROLLS team | - | 4K | 33.74 | 21.7 | 16.1 | 16.9 | 22.0 | 52.7 | 24.6 | 69.0 | 33.5 | 31.3 | 49.5 |
05/23 | Flan-UL2 | ZeroSCROLLS team | 20B | 8K | 30.62 | 16.1 | 11.5 | 13.6 | 5.7 | 56.9 | 25.5 | 75.6 | 51.3 | 36.0 | 14.0 |
05/23 | Flan-T5 | ZeroSCROLLS team | 11B | 8K | 29.90 | 17.6 | 7.8 | 11.0 | 8.0 | 48.3 | 19.3 | 75.2 | 46.8 | 48.7 | 16.4 |
05/23 | Naive | ZeroSCROLLS team | - | - | 19.64 | 22.6 | 6.7 | 6.7 | 10.5 | 6.1 | 2.1 | 26.6 | 20.0 | 45.0 | 50.0 |
05/23 | T0pp | ZeroSCROLLS team | 11B | 8K | 14.34 | 7.1 | 9.6 | 7.2 | 3.9 | 25.0 | 18.7 | 21.4 | 35.3 | 15.2 | 0.0 |
Click here for a downloadable version of the leaderboard with a full breakdown of results.
The dataset abbreviations stand for: GovReport, SummScreenFD, QMSum, SQuALITY, Qasper, NarrativeQA, QuALITY, MuSiQue, SpaceDigest, BookSumSort.
Metrics details
-
Summarization tasks (GovReport, SummScreenFD, QMSum and SQuALITY) scores are given as the geometric mean of Rouge-1/2/L
-
Qasper, NarrativeQA and MuSiQue are scored by F1
-
QuALITY is scored by accuracy
-
SpaceDigest is scored with the exponential similarity as described in the paper
-
BookSumSort score is given by concordance Index
bottom of page