DocSpider: a dataset of cross-domain natural language querying for MongoDB

ÖZER, ARİF; ÇEKİNEL, RECEP; TOROSLU, İSMAİL; KARAGÖZ, PINAR

doi:10.1017/nlp.2024.63

DocSpider: a dataset of cross-domain natural language querying for MongoDB

ÖZER A. G., ÇEKİNEL R. F., TOROSLU İ. H., KARAGÖZ P.

NATURAL LANGUAGE PROCESSING, cilt.31, sa.6, ss.1367-1398, 2025 (SCI-Expanded, AHCI, SSCI, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 31 Sayı: 6
Basım Tarihi: 2025
Doi Numarası: 10.1017/nlp.2024.63
Dergi Adı: NATURAL LANGUAGE PROCESSING
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Arts and Humanities Citation Index (AHCI), Social Sciences Citation Index (SSCI), Scopus
Sayfa Sayıları: ss.1367-1398
Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

Natural language querying allows users to formulate questions in a natural language without requiring specific knowledge of the database query language. Large language models have been very successful in addressing the text-to-SQL problem, which is about translating given questions in textual form into SQL statements. Document-oriented NoSQL databases are gaining popularity in the era of big data due to their ability to handle vast amounts of semi-structured data and provide advanced querying functionalities. However, studies on text-to-NoSQL systems, particularly on systems targeting document databases, are very scarce. In this study, we utilize large language models to create a cross-domain natural language to document database query dataset, DocSpider, leveraging the well-known text-to-SQL challenge dataset Spider. As a document database, we use MongoDB. Furthermore, we conduct experiments to assess the effectiveness of the DocSpider dataset to fine-tune a text-to-NoSQL model against a cross-language transfer learning approach, SQL-to-NoSQL, and zero-shot instruction prompting. The experimental results reveal a significant improvement in the execution accuracy of fine-tuned language models when utilizing the DocSpider dataset.