Implementation of ETL Process using Pig and Hadoop
Anushree Raj1, Rio D. Souza2
1Anushree Raj*, Research Scholar, CSE Department, St Joseph Engineering College, Mangalore India.
2Rio D’Souza Department of CSE, St Joseph Engineering College, Mangalore, India.
Manuscript received on January 05, 2020. | Revised Manuscript received on January 25, 2020. | Manuscript published on January 30, 2020. | PP: 4896-4899 | Volume-8 Issue-5, January 2020. | Retrieval Number: E4901018520/2020©BEIESP | DOI: 10.35940/ijrte.E4901.018520
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: ETL stands for extraction, transformation and loading, where extraction is done to active data from the source, transformation involve data cleansing, data filtering, data validation and finally application of certain rules and loading stores back the data to the destination repository where it has to finally reside. Pig is one of the most important to which could be applied in Extract, Transform and Load (ETL) process. It helps in applying the ETL approach to the large set of data. Initially Pig loads the data, and further is able to perform predictions, repetitions, expected conversions and further transformations. UDFs can be used to perform more complex algorithms during the transformation phase. The huge data processed by Pig, could be stored back in HDFS. In this paper we demonstrate the ETL process using Pig in Hadoop. Here we demonstrate how the files in HDFS are extracted, transformed and loaded back to HDFS using Pig. We extend the functionality of Pig Latin with Python UDFs to perform transformations.
Keywords: ETL Process, Extract, Load, HDFS ETL, Pig Latin, Python Udfs, Transform.
Scope of the Article: Process & Device Technologies.