Existing methods for egocentric activity recognition are mostly based on extracting motion characteristics from videos. On the other hand, ubiquity of wearable sensors allow acquisition of information from different sources. Although the increase in sensor diversity brings out the need for adaptive fusion, most of the studies use pre-determined weights for each source. In addition, there are a limited number of studies making use of optical, audio and wearable sensors. In this work, we propose a new framework that adaptively weighs the visual, audio and sensor features in relation to their discriminative abilities. For that purpose, multi-kernel learning (MKL) is used to fuse multi-modal features where the feature and kernel selection/weighing and recognition tasks are performed concurrently. Audio-visual information is used in association with the data acquired from wearable sensors since they hold information on different aspects of activities and help building better models. The proposed framework can be used with different modalities to improve the recognition accuracy and easily be extended with additional sensors. The results show that using multi-modal features with MKL outperforms the existing methods.