An Effective Straggler tolerant Scheme in Big Data Processing Systems using Machine Learning
Shyam Deshmukh1, K. Thirupathi Rao2, B. Thirumala Rao3, Vaibhav Pawar4

1Shyam Deshmukh, Research Scholar, KL Deemed to be University, (Andhra Pradesh), India.
2K. Thirupathi Rao, KL Deemed to be University, (Andhra Pradesh), India.
3B. Thirumala Rao, KL Deemed to be University, (Andhra Pradesh), India.
4Vaibhav Pawar, SP Pune University, (Andhra Pradesh), India.
Manuscript received on 25 March 2019 | Revised Manuscript received on 06 April 2019 | Manuscript Published on 18 April 2019 | PP: 758-762 | Volume-7 Issue-6S March 2019 | Retrieval Number: F02148376S19/2019©BEIESP
Open Access | Editorial and Publishing Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: The priority of major vendors of data storage is always to the faster job completion time and efficient resource utilization in cloud envi-ronment. Slow running or poor performing cluster nodes continue to be a major hurdle for faster job execution in cloud environment. Var-ious existing mitigation techniques which neglects these slow processing nodes i.e. stragglers and try to optimize resource utilization as well as response time are discussed with their limitations. In this paper, the aim is to build a blacklisting-enabled machine learning-based straggler tolerant technique using Apache spark framework which identifies straggler in a cluster. This straggler tolerant scheme act as a decision support system for the scheduler which predicts and avoid the task assignment to the straggler node regardless of internal and external causes of Straggler. Decision tree is constructed using job utilization and time execution metrics. Various experiments were carried out using default apache spark scheduler , blacklisting-enabled apache scheduler and blacklisting-enabled machine learning based scheduler using input workloads such as Word Count and Tera Sort. The results shows that our approach reduces the Job Completion Time of task by 19% and gives better utilization of resources in cloud environment.
Keywords: Apache Spark; Cloud Environment; Distributed Systems; Machine Learning; Straggler.
Scope of the Article: Machine Learning