Improving the performance of Hadoop/Hive by sharing scan and computation tasks


Tezin Türü: Yüksek Lisans

Tezin Yürütüldüğü Kurum: Orta Doğu Teknik Üniversitesi, Fen Bilimleri Enstitüsü, Fen Bilimleri Enstitüsü, Türkiye

Tezin Onay Tarihi: 2013

Öğrenci: SERKAN ÖZAL

Danışman: AHMET COŞAR

Özet:

MapReduce is a popular model of executing time-consuming analytical queries as a batch of tasks on large scale data. During simultaneous execution of multiple queries, many oppor- tunities can arise for sharing scan and/or computation tasks. Executing common tasks only once can reduce the total execution time of all queries remarkably. Therefore, we propose to use Multiple Query Optimization (MQO) techniques to improve the overall performance of Hadoop Hive, an open source SQL-based distributed warehouse system based on MapReduce. Our framework, SharedHive, transforms a set of correlated HiveQL queries into new global queries that can produce the same results in remarkably smaller total execution times. It is ex- perimentally shown that SharedHive outperforms the conventional Hive by %20-90 reduction, depending on the number of queries and percentage of shared tasks, in the total execution time of correlated TPC-H queries.