CSFC: A New Centroid Based Clustering Method to Improve the Efficiency of Storing and Accessing Small Files in Hadoop
R. Rathi Devi1, R. Parameswari2

1R. Rathi Devi, Research Scholar, Department of Computer Science, Vels Institute of Science, Technology & Advanced Studies, Chennai (Tamil Nadu), India.
2Dr. R. Parameswari, Associate Professor, Department of Computer Science, Vels Institute of Science, Technology & Advanced Studies, Chennai (Tamil Nadu), India.
Manuscript received on 19 January 2020 | Revised Manuscript received on 02 February 2020 | Manuscript Published on 05 February 2020 | PP: 122-127 | Volume-8 Issue-4S5 December 2019 | Retrieval Number: D10141284S519/2019©BEIESP | DOI: 10.35940/ijrte.D1014.1284S519
Open Access | Editorial and Publishing Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: In day to day life, the computer plays a major role, due to this advancement of technology collection of data from various fields are increasing. A large amount of data is produced by various fields for every second and is not easy to process. This large amount of data is called as Big data. A large number of small files also considered as Big data. It’s not easy to process and store the small files in Hadoop. In the existing methods Merging technologies and Clustering Techniques are used to combine smaller files to large files up to 128 MB before sending it to HDFS in Hadoop. In the Proposed system CSFC (Clustering Small Files based on Centroid) Clustering Technique is used without mentioning the number of Clusters previously because if the clusters are mentioned before, all the files are clubbed within the limited number of clusters. In proposing system clusters are generated by depending on the number of related files in the dataset. The relevant files are combined up to 128 MB in a cluster. If any file is not relevant to the existing cluster or if the memory size reached 128MB then-new cluster will be generated and the file will be stored. It is easy to process the related files, comparing two irrelevant files. By using this method fetching data from the data node, it produces efficient result when comparing with other clustering techniques.
Keywords: Datanode, Hadoop Distribuited File System, Hadoop, Namenode.
Scope of the Article: Clustering