Interchangeable Token Embeddings for Extendable Vocabulary and Alpha-Equivalence


Işık İ., CİNBİŞ R. G., Gol E. A.

42nd International Conference on Machine Learning, ICML 2025, Vancouver, Kanada, 13 - 19 Temmuz 2025, cilt.267, ss.26523-26541, (Tam Metin Bildiri) identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Cilt numarası: 267
  • Basıldığı Şehir: Vancouver
  • Basıldığı Ülke: Kanada
  • Sayfa Sayıları: ss.26523-26541
  • Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

Language models lack the notion of interchangeable tokens: symbols that are semantically equivalent yet distinct, such as bound variables in formal logic. This limitation prevents generalization to larger vocabularies and hinders the model’s ability to recognize alpha-equivalence, where renaming bound variables preserves meaning. We formalize this machine learning problem and introduce alpha-covariance, a metric for evaluating robustness to such transformations. To tackle this task, we propose a dual-part token embedding strategy: a shared component ensures semantic consistency, while a randomized component maintains token distinguishability. Compared to a baseline that relies on alpha-renaming for data augmentation, our approach demonstrates improved generalization to unseen tokens in linear temporal logic solving, propositional logic assignment prediction, and copying with an extendable vocabulary, while introducing a favorable inductive bias for alphaequivalence. Our findings establish a foundation for designing language models that can learn interchangeable token representations, a crucial step toward more flexible and systematic reasoning in formal domains. Our code and project page are available at necrashter.github.io/interchangeabletoken-embeddings