Preserve Quality Medical Drug Data Toward Meaningful Data Lake by Cluster
Areen Al-Hgaish1, Wael Alzyadat2, Mohammad Al-Fayoumi3, Aysh Alhroob4, Ahmad Thunibat5

1SAreenAL-Hgaish,Department of Software Engineering, Faculty of Information Technology, Isra University, Amman, Jordan.
2Wael ALzyadat,department of Software Engineering at Al-Zaytoonah University of Jordan, Amman, Jordan.
3Mohammad Al-Fayoumi, Department of Software Engineering, Faculty of Information Technology, Isra University, Amman, Jordan.
4AyshAlhroob,Department of Software Engineering, Faculty of Information Technology, Isra University, Amman, Jordan.
5Ahmad Thunibat, department of Software Engineering at Al-Zaytoonah University of Jordan, Amman, Jordan.

Manuscript received on 15 August 2019. | Revised Manuscript received on 25 August 2019. | Manuscript published on 30 September 2019. | PP: 270-277 | Volume-8 Issue-3 September 2019 | Retrieval Number: C4129098319/19©BEIESP | DOI: 10.35940/ijrte.C4129.098319
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: Big data is facing many challenges in different aspects, which appear in characteristics such as: Velocity, Volume, Value and Veracity. Processing and analysis of big data are challenging issues to acquire quality information in order to support accurate medical drug practice. The quality of data taxonomy is indicated by three basic elements: are meaningful, predication and decision-making. These elements have been encouraged in previous work that focused on the same challenges of big data. Consequently, the proposed approach preserves the quality of medical drug data toward meaningful data lake by clustering. It consists of four components. Data collection and pre-processing represent the first component in the data lake. Profile data is treated with semi-structured data to clean it up. The second component is extracting data through enforcing rules on whole data to produce different groups and generate weight based on constraints within groups. In component three, data is organized and clustering. This component complies with schema profiling referring to component two in the data lake. Weight outputs of component three are inputs for component four, where K-Mean clustering is applied to obtain different clusters. Each cluster presents an alternative drug to achieve meaningful drug data that is consistent with component three in the data lake.This paper addressed two main challenges; the first challenge is extracting meaningful data from big data; whereas the second challenge is using big data technique with K-Mean clustering algorithm. An experimental approach was followed through using Food and Drug Administration (FDA) data and symptoms in R framework. ANOVA statistical test was carried out to calculate sum of square error, P- Value and F-Valuefor the evaluation of variances between clusters and variances within clusters. The results showed the efficiency of the proposed approach.
Keywords: Data Lake, K-Mean Clustering, Big Data, Semi-Structured Data.
Scope of the Article: