Stratification of Spam and Ham Short Message Service using Machine Learning Hash Vectorization
M. Shyamala Devi1, M. Nizar Ahamed2, S. Aarif Ahamed3

1M. Shyamala Devi, Professor, Computer Science and Engineering, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Avadi, Chennai, Tamil Nadu, India.
2M. Nizar Ahamed, Assistant Professor, Computer Science and Engineering, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Avadi, Chennai, Tamil Nadu, India.
3S. Arif Ahamed, Assistant Professor, Computer Science and Engineering, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Avadi, Chennai, Tamil Nadu, India.
Manuscript received on January 02, 2020. | Revised Manuscript received on January 15, 2020. | Manuscript published on January 30, 2020. | PP: 472-476 | Volume-8 Issue-5, January 2020. | Retrieval Number: E4911018520/2020©BEIESP | DOI: 10.35940/ijrte.E4911.018520

Open Access | Ethics and Policies | Cite | Mendeley
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: With the increase in the usage of mobile technology, the rate of information is duplicated as a huge volume. Due to the volume duplication of message, the identification of spam messages leads to challenging task. The growth of mobile usage leads to instant communication only through messages. This drastically leads to hackers and unauthorized users to the spread and misuse of sending spam messages. The identification of spam messages is a research oriented problem for the mobile service providers in order to raise the number of customers and to retain them. With this overview, this paper focuses on identifying and prediction of spam and ham messages. The SMS Spam Message Detection dataset from KAGGLE machine learning Repository is used for prediction analysis. The identification of spam and ham messages is done in the following ways. Firstly, the levels of spread of target variable namely spam or ham is identified and they are depicted as a graph. Secondly, the essential tokens that are responsible for the spam and ham messages are identified and they are found by using the hashing Vectorizer and it is portrayed in the form of spam and Ham messages word cloud. Thirdly, the hash vectorized SMS Spam Message detection dataset is fitted to various classifiers like Ada Boost Classifier, Extra Tree classifier, KNN classifier, Random Forest classifier, Linear SVM classifier, Kernel SVM classifier, Logistic Regression classifier, Gaussian Naive Bayes classifier, Decision Tree classifier, Gradient Boosting classifier and Multinomial Naive Bayes classifier. The evaluation of the classifier models are done by analyzing the Performance analysis metrics like Accuracy, Recall, FScore, Precision and Recall. The implementation is done by python in Anaconda Spyder Navigator. Experimental Results shows that the Linear Support Vector Machine classifier have achieved the effective performance indicators with the precision of 0.98, recall of 0.98, FScore of 0.98 , and Accuracy of 98.71%.
Keywords: Machine Learning, Hashing, Vectorizer, Spam, Ham and Classifier
Scope of the Article: Machine Learning.