Breast Cancer Prognosis with Apache Spark Random Forest Pipeline
Timmana Hari Krishna1, C. Rajabhushanam2

1Timmana Hari Krishna, Department of Computer Science and Engineering, Bharath Institute of Higher Education and Research, Chennai (Tamil Nadu), India.
2Dr. C. Rajabhushanam, Department of Computer Science and Engineering, Bharath Institute of Higher Education and Research, Chennai (Tamil Nadu), India.
Manuscript received on 21 May 2019 | Revised Manuscript received on 07 June 2019 | Manuscript Published on 15 June 2019 | PP: 275-277 | Volume-8 Issue-1S2 May 2019 | Retrieval Number: A00630581S219/2019©BEIESP
Open Access | Editorial and Publishing Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: Brest cancer is one of the most common cancers diagnosed in women in western countries. Breast cancer research and awareness supports the improvements in cancer diagnosis and treatment. Early detection of Breast cancer improves the survival rates and decreases the number of deaths related to this disease. Recently Computer concepts are spread across all domains including medical and healthcare. Data science and machine learning techniques are used in cancer prediction and analysis to get rapid accurate results. The cancer prediction involves the identification malignant cells from breast cells. Researchers and Pathologists used the several machine learning algorithms like K-Nearest Neighbors, logistic support vector machine, artificial neural networks and decision tree in cancer prediction. They did not conclude the feasible method for cancer prediction. In this paper we propose a scalable, fault tolerant pipeline model that analyses big cancer data in and predicts the cancerous cells in real time. This model is developed on Apache Spark using Machine Learning Pipeline. In this paper, we implemented our pipeline using Random Forest algorithm to compare with baseline model in terms of accuracy and performance.
Keywords: Apache Spark, Machine Learning Pipeline, Cancer Prediction, Random Forests.
Scope of the Article: Machine Learning