Techniques for Data Extraction from Heterogeneous Sources with Data Security
Kimmi Kumari1, M Mrunalini2
1Kimmi Kumari, MCA, M S Ramaiah Institute of Technology, Bangalore, India.
2Dr. M mrunalini, MCA, M S Ramaiah Institute of Technology, Bangalore, India.
Manuscript received on 05 March 2019 | Revised Manuscript received on 11 March 2019 | Manuscript published on 30 July 2019 | PP: 2152-2159 | Volume-8 Issue-2, July 2019 | Retrieval Number: B3254078219/19©BEIESP | DOI: 10.35940/ijrte.B3254.078219
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: Data Extraction is the process of mining or fetching relevant information from unstructured data or the heterogeneous sources of data. This paper aims at mining data from three different sources such as online website, flat files and database and the extracted data are even analyzed in terms of precisions, recall and accuracy. In the environment of heterogeneous sources of data, data extraction is one of the crucial issue and therefore considering the present scenario, we can observe that the heterogeneity is expanding widespread. So this paper focus on the different sources for the data extraction and provides a single framework to perform the required tasks. In this paper, healthcare data are considered in order to show the processing starting from data extraction using three different sources to dividing them in to two clusters based on the thresholds value which has been calculated using cosine similarity and finally calculations of parameters like precisions, recall and accuracy for analyzation purpose. Fetching data online is the task in which we cannot fetch simple string from any website. The backend of each page is html and hence this paper focus on extracting that html of the page while mining data from any web server. The webpage contains a lot of html tags and all of these cannot be removed because they are complex tags which cannot be removed by regular expressions. But still 60% filtered data can be attained as demonstrated in this paper as most of the waste html will be removed. While filtration of the data, we should also note that the content containing Google APIs cannot be removed. So filtered data will contain the contents and tags which does not contain Google APIs. In order to provide data security while extraction, the connection string is being used to avoid tampering of data. This paper also focuses on one of the arguable concepts present in the generation of big data which is Data Lake. In originality, the origin about the idea of Data Lake appears from the field of business. An architectural approach which is specially designed in order to store all the data which are potentially relevant in a repository located centrally is referred to as Data Lake. The data which are stored in the central based repository are fetched from the sources belonging to public as well as enterprises and these data are further used for the purpose of organization, discovery of hidden facts, understanding of new concepts, analyzation of stored information etc. Many challenges and concerns related to privacy are faced during the adoption of Data Lake as it is a new concept which brings revolutionization. This paper also highlights some of the issues imposed by Data Lake.
Keywords: Accuracy, Data Extraction, Data Lake, Data Security.
Scope of the Article: Heterogeneous and Streaming Data