Understanding Performance of Long-Document Ranking Models through Comprehensive Evaluation and Leaderboarding
We evaluated 20+ Transformer models for ranking of long documents (including recent LongP models trained with FlashAttention) and compared them with a simple FirstP baseline, which applies the same model to the truncated input (at most 512 tokens). We used MS MARCO Documents v1 as a primary training set and evaluated both the zero-shot transferred and fine-tuned models.
On MS MARCO, TREC DLs, and Robust04 no long-document model outperformed FirstP by more than 5% in NDCG and MRR (when average…