Landmark based guidance for reinforcement learning agents under partial observability

Demir, Alper; Çilden, Erkin; Polat, FARUK

doi:10.1007/s13042-022-01713-5

Landmark based guidance for reinforcement learning agents under partial observability

Atıf İçin Kopyala

Demir A., Çilden E., Polat F.

International Journal of Machine Learning and Cybernetics, cilt.14, sa.4, ss.1543-1563, 2023 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 14 Sayı: 4
Basım Tarihi: 2023
Doi Numarası: 10.1007/s13042-022-01713-5
Dergi Adı: International Journal of Machine Learning and Cybernetics
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC
Sayfa Sayıları: ss.1543-1563
Anahtar Kelimeler: Diverse density, Landmark based guidance, Partial observability, Reinforcement learning, TEMPORAL ABSTRACTION, FRAMEWORK
Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

© 2022, The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.Under partial observability, a reinforcement learning agent needs to estimate its true state by solely using its observation semantics. However, this interpretation has a drawback, which is called perceptual aliasing, avoiding the convergence guarantee of the learning algorithm. To overcome this issue, the state estimates are formed by the recent experiences of the agent, which can be formulated as a form of memory. Although the state estimates may still yield ambiguous action mappings due to aliasing, some estimates exist that naturally disambiguate the present situation of the agent in the domain. This paper introduces an algorithm that incorporates a guidance mechanism to accelerate reinforcement learning for partially observable problems with hidden states. The algorithm makes use of the landmarks of the problem, namely the distinctive and reliable experiences in the state estimates context within an ambiguous environment. The proposed algorithm constructs an abstract transition model by utilizing the landmarks observed, calculates their potentials throughout learning -as a mechanism borrowed from reward shaping-, and concurrently applies the potentials to provide guiding rewards for the agent. Additionally, we employ a known multiple instance learning method, diverse density, for automatically discovering landmarks before learning, and combine both algorithms to form a unified framework. The effectiveness of the algorithms is empirically shown via extensive experimentation. The results show that the proposed framework not only accelerates the underlying reinforcement learning methods, but also finds better policies for representative benchmark problems.