IEEE Transactions on Intelligent Vehicles, cilt.10, sa.9, ss.4397-4408, 2025 (SCI-Expanded, Scopus)
The recent progress in autonomous driving primarily relies on sensor-rich systems, encompassing radars, LiDARs, and advanced cameras, in order to perceive the environment. However, human-operated vehicles showcase an impressive ability to drive based solely on visual perception. This study introduces an end-to-end method for predicting the steering angle and vehicle speed exclusively from a monocular camera image. Alongside the color image, which conveys scene texture and appearance details, a monocular depth image and a semantic segmentation image are internally derived and incorporated, offering insights into spatial and semantic environmental structures. This results in a total of three input images. Moreover, LSTM units are also employed to acquire temporal features. The proposed model demonstrates a significant enhancement in RMSE compared to the state-of-the-art, achieving a notable improvement of 44.96% for the steering angle and 4.39% for the speed on the Udacity dataset. Furthermore, tests on the CARLA and Sully Chen datasets yield results that outperform those reported in the literature. Extensive ablation studies are also conducted to showcase the effectiveness of each component. These findings highlight the potential of self-driving systems using visual input alone.