Leaderboard | ZeroSCROLLS Benchmark

Date	Model	Contributors	#Params	Input Length	Score (Avg.)	GvRp	SSFD	QMsm	SQAL	Qspr	Nrtv	QALT	MuSQ	SpDg	BkSS
05/23	GPT-4	ZeroSCROLLS team	-	8K	41.67	26.3	17.3	18.5	22.6	50.7	27.6	89.2	41.1	62.8	60.5
05/23	Claude	ZeroSCROLLS team	-	8K	39.07	24.2	16.1	14.6	21.0	52.3	32.6	84.8	36.1	61.6	47.4
07/23	Llama 2 Long	Meta	70B	16K	37.71	26.0	15.0	20.0	20.9	52.0	31.7	82.6	27.3	55.5	46.2
05/23	ChatGPT	ZeroSCROLLS team	-	4K	34.02	21.3	16.1	15.6	20.4	49.3	25.1	66.6	27.1	49.1	49.8
05/23	DaVinci003	ZeroSCROLLS team	-	4K	33.74	21.7	16.1	16.9	22.0	52.7	24.6	69.0	33.5	31.3	49.5
05/23	Flan-UL2	ZeroSCROLLS team	20B	8K	30.62	16.1	11.5	13.6	5.7	56.9	25.5	75.6	51.3	36.0	14.0
05/23	Flan-T5	ZeroSCROLLS team	11B	8K	29.90	17.6	7.8	11.0	8.0	48.3	19.3	75.2	46.8	48.7	16.4
09/24	simpo_llama3	zecheng			29.85	21.1	13.5	16.7	18.1	50.1	24.8	52.2	19.9	46.1	36.1
08/23	Stable Beluga 7B	yuzhenm		4K	23.01	13.0	13.8	14.6	17.8	28.3	15.9	51.8	18.5	46.9	9.5
04/24	por_ms	1	1	1	20.45	21.8	13.6	15.5	21.0	21.2	17.7	47.8	9.3	33.4	3.4
09/24	Llama 3.3-3B	Tam Doan	3B		2.61	26.1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
11/23	GPT4 Turbo	Tam Doan			2.40	24.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
10/23	graph	Tam Doan			2.37	23.7	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
10/23	GPT4	Tam Doan			2.29	22.9	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
10/24	RC_Llama3.2-3B Instruct	Tam Doan			2.23	22.3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
01/25	GPTo	Tam Doan			2.20	22.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
02/25	RC_GPT4o	Tam Doan			2.11	21.1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
05/23	Naive	ZeroSCROLLS team	-	-	19.64	22.6	6.7	6.7	10.5	6.1	2.1	26.6	20.0	45.0	50.0
04/24	llama2_H2O_final	zwang			19.41	15.4	13.2	14.3	18.3	20.5	15.0	43.2	9.5	40.8	3.8
04/24	3-4-open	1	1	4k	19.26	22.6	13.7	15.4	21.1	24.1	17.6	43.8	8.1	24.4	1.8
04/24	llama2_7B_chat_ours	zwang33			18.96	15.2	11.9	14.3	17.9	19.7	15.1	42.8	9.9	39.0	3.6
09/23	Stable Beluga 13B	yuzhenm	13B	4K	16.84	6.0	7.4	12.8	13.3	20.0	13.4	47.8	25.0	14.8	7.9
08/24	llama2-7b-chat	1	1	4k	15.47	16.4	7.3	12.0	15.7	14.1	10.3	22.2	10.9	44.0	1.7
03/24	llama2chat_bestbase_1e-4_7top3_bs8_ratio1_gate_v3_3-pretrain-4	1	1	1	14.99	10.6	13.0	15.6	18.9	23.1	17.7	40.0	10.9	0.0	0.0
03/24	1	1	1	1	14.74	10.7	13.4	15.9	19.2	21.7	18.7	39.0	8.8	0.0	0.0
05/23	T0pp	ZeroSCROLLS team	11B	8K	14.34	7.1	9.6	7.2	3.9	25.0	18.7	21.4	35.3	15.2	0.0

Click here for a downloadable version of the leaderboard with a full breakdown of results.

The dataset abbreviations stand for: GovReport, SummScreenFD, QMSum, SQuALITY, Qasper, NarrativeQA, QuALITY, MuSiQue, SpaceDigest, BookSumSort.

Metrics details

Summarization tasks (GovReport, SummScreenFD, QMSum and SQuALITY) scores are given as the geometric mean of Rouge-1/2/L
Qasper, NarrativeQA and MuSiQue are scored by F1
QuALITY is scored by accuracy
SpaceDigest is scored with the exponential similarity as described in the paper
BookSumSort score is given by concordance Index