Document Reordering for Efficient Dynamic Pruning in Traditional and Learned Sparse Retrieval


Yafay E., ALTINGÖVDE İ. S.

IEEE Access, cilt.14, ss.70175-70187, 2026 (SCI-Expanded, Scopus) identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 14
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1109/access.2026.3690983
  • Dergi Adı: IEEE Access
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC, Directory of Open Access Journals
  • Sayfa Sayıları: ss.70175-70187
  • Anahtar Kelimeler: dynamic pruning, Information retrieval, inverted index, neural retrieval, search engines
  • Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

Large scale Web search engines conduct first-stage retrieval using a traditional inverted index, on which they are expected to execute billions of queries on a very tight budget, i.e., less than 200 milliseconds. During query processing, the postings lists corresponding to each query term are traversed in parallel, i.e., applying document-at-a-time (DaaT) strategy, and a min-heap of top-k documents is employed to store the top-ranked documents. Dynamic pruning algorithms accelerate DaaT query processing by skipping documents that cannot enter the top-k results. Their efficiency, however, depends on how quickly the heap threshold increases during scoring. This paper explores an alternative route to faster pruning: rather than estimating the threshold in advance, we reorder documents within postings lists based on their frequency, position or score in past query results, allowing the threshold to rise more rapidly. We show that the tailored Hit Count (HC) reordering - derived from document frequency in historical query logs - achieves significant speedups for both traditional and learned sparse models. Specifically, HC provides up to a 1.12× speedup for BM25-based scoring on ClueWeb09B and SPLADE-based scoring on MS MARCO, with only a modest increase in index size. This demonstrates that our lightweight document reordering approach yields consistent efficiency gains in both traditional and neural information retrieval settings.