Querying Beyond Keywords: Translating Natural Language to Elasticsearch DSL with Vector Search Support


Ozdemir A. Y., KARAGÖZ P., TOROSLU İ. H.

IEEE Access, 2026 (SCI-Expanded, Scopus) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1109/access.2026.3701709
  • Dergi Adı: IEEE Access
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC, Directory of Open Access Journals
  • Anahtar Kelimeler: Elasticsearch, Hybrid search, Large language models, Natural language querying, Query translation, Semantic search
  • Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

Despite Elasticsearch’s popularity in modern search pipelines, translating natural language queries (NLQs) into its domain-specific language (DSL) remains challenging, especially with the integration of semantic vector search. Existing benchmarks primarily focus on keyword-based query translation, overlooking the increasingly essential hybrid scenarios where keyword filters are combined with vector-based similarity retrieval. This paper introduces the first benchmark designed explicitly for evaluating the NLQ-to-Elasticsearch DSL translation in hybrid search contexts. Leveraging a carefully curated subset of the WikiSQL dataset enriched with related Wikipedia text embeddings, our benchmark supports structured queries integrated with dense semantic similarity conditions. Through this approach, we provide a structured and controlled evaluation framework capable of assessing not only the syntactic accuracy of generated Elasticsearch DSL queries but also the semantic reasoning capabilities of Large Language Models (LLMs). Our experiments measure the zero-shot and few-shot performance of state-of-the-art LLMs on hybrid query generation tasks, illuminating their strengths and limitations in capturing complex user intents across keyword-based and vector similarity dimensions. In addition to prompting-based evaluation, we investigate the impact of parameter-efficient fine-tuning on a representative open-source model to assess its potential for narrowing the performance gap with proprietary systems. Two central conclusions emerge from our analysis. First, GPT models generally outperform open-source systems in absolute accuracy and reliability under pure prompting settings. Second, certain open-source models, particularly Qwen2.5-Coder variants, demonstrate substantial few-shot gains, and with parameter-efficient fine-tuning, can significantly reduce this gap, matching or even surpassing GPT-5 under zero-shot prompting on specific query categories. Overall, the proposed benchmark highlights both the current limitations and the untapped potential of open models in hybrid NLQ-to-DSL translation, advancing research toward more expressive and semantically grounded natural-language search interfaces.