DocSpider: a dataset of cross-domain natural language querying for MongoDB


ÖZER A. G., ÇEKİNEL R. F., TOROSLU İ. H., KARAGÖZ P.

NATURAL LANGUAGE PROCESSING, cilt.31, sa.6, ss.1367-1398, 2025 (SCI-Expanded, AHCI, SSCI, Scopus) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 31 Sayı: 6
  • Basım Tarihi: 2025
  • Doi Numarası: 10.1017/nlp.2024.63
  • Dergi Adı: NATURAL LANGUAGE PROCESSING
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Arts and Humanities Citation Index (AHCI), Social Sciences Citation Index (SSCI), Scopus
  • Sayfa Sayıları: ss.1367-1398
  • Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

Natural language querying allows users to formulate questions in a natural language without requiring specific knowledge of the database query language. Large language models have been very successful in addressing the text-to-SQL problem, which is about translating given questions in textual form into SQL statements. Document-oriented NoSQL databases are gaining popularity in the era of big data due to their ability to handle vast amounts of semi-structured data and provide advanced querying functionalities. However, studies on text-to-NoSQL systems, particularly on systems targeting document databases, are very scarce. In this study, we utilize large language models to create a cross-domain natural language to document database query dataset, DocSpider, leveraging the well-known text-to-SQL challenge dataset Spider. As a document database, we use MongoDB. Furthermore, we conduct experiments to assess the effectiveness of the DocSpider dataset to fine-tune a text-to-NoSQL model against a cross-language transfer learning approach, SQL-to-NoSQL, and zero-shot instruction prompting. The experimental results reveal a significant improvement in the execution accuracy of fine-tuned language models when utilizing the DocSpider dataset.