Similarity Detection in Large Volume Data using Machine Learning Techniques
Viji Gopal1, Varghese Paul2, M Sudheep Elayidom3, Sasi Gopalan4

1Viji Gopal, Department of Information Technology, School of Engineering, Cochin University of Science and Technology, Cochin, Kerala, India.
2Dr. Varghese Paul, Professor, Department of Information Technology, Rajagiri School of Engineering and Technology, Kakkanad, Kochi, Kerala, India.
3Dr. M Sudheep Elayidom, Professor, Department of Computer Science, School of Engineering, Cochin University of Science and Technology, Cochin, Kerala, India.
4Dr. Sasi Gopalan, Associate Professor, Division of Applied Sciences and Humanities, School of Engineering, Cochin University of Science and Technology, Cochin, Kerala, India.

Manuscript received on 15 August 2019. | Revised Manuscript received on 25 August 2019. | Manuscript published on 30 September 2019. | PP: 735-739 | Volume-8 Issue-3 September 2019 | Retrieval Number: C3987098319/19©BEIESP | DOI: 10.35940/ijrte.C3987.098319
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (

Abstract: When unauthorized copying or stealing of intellectual properties of others happen, it is called plagiarism. Two main approaches are used to counter this problem – external plagiarism detection and intrinsic plagiarism detection. External algorithms compare a suspicious file with numerous sources whereas intrinsic algorithms are allowed to solely inspect the suspicious file in order to predict plagiarism. In this work, the area chosen for detecting plagiarism is with programs or source code files. Copying the entire source code or logic used in a particular program without permissions or copyright is the stealing that happens in the case of source codes. There exist many ways to detect plagiarism in source code files. To perform plagiarism checking for a large dataset, the computational cost is very high and moreover it’s a time consuming job. To achieve a computationally efficient similarity detection in source code files, the Hadoop framework is used where parallel computation is possible for large datasets. But the raw data available to us is not in a suitable form for the existing plagiarism checking tools to work with, as their size is too high and they possess features of big data. Thus a qualifying model is required for the dataset, to be fed into Hadoop so that it could efficiently process them to check for plagiarism in source codes. To generate such a model, machine learning is used which incorporates big data with machine learning.
Index Terms: Plagiarism, Big Data, Similarity, Hadoop, Machine Learning

Scope of the Article: Machine Learning