Feature Selection on Noisy Twitter Short Text Messages for Language Identification
Mohd Zeeshan Ansari1, Tanvir Ahmad2, Ana Fatima3
1Mohd Zeeshan Ansari, Computer Engineering, Jamia Millia Islamia, New Delhi, India.
2Tanvir Ahmad, Computer Engineering, Jamia Millia Islamia, New Delhi, India.
3Ana Fatima, Computer Engineering, Jamia Millia Islamia, New Delhi, India. 

Manuscript received on November 11, 2019. | Revised Manuscript received on November 20 2019. | Manuscript published on 30 November, 2019. | PP: 10505-10510 | Volume-8 Issue-4, November 2019. | Retrieval Number: D4360118419/2019©BEIESP | DOI: 10.35940/ijrte.D4360.118419

Open Access | Ethics and Policies | Cite  | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: The task of written language identification involves typically the detection of the languages present in a sample of text. Moreover, a sequence of text may not belong to a single inherent language but also may be a mixture of text written in multiple languages. This kind of text is generated in large volumes from social media platforms due to its flexible and user friendly environment. Such text contains very large number of features which are essential for development of statistical, probabilistic as well as other kinds of language models. The large number of features have rich as well as irrelevant and redundant features which have diverse effect over the performance of the learning model. Therefore, feature selection methods are significant in choosing features that are most relevant for an efficient model. In this article, we consider the Hindi-English language identification task as Hindi and English are often the two most widely spoken languages of India. We apply different feature selection algorithms across various learning algorithms in order to analyze the effect of the algorithm as well as the number of selected features on the performance of the task. The methodology focuses on the word level language identification using a novel dataset of 6903 tweets extracted from Twitter. Various n-gram profiles are examined with different feature selection algorithms over many classifiers. Finally, an exhaustive comparative analysis is put forward with respect to the overall experiments conducted for the task.
Keywords: Code Mixing, Feature Selection, Language Idenification, Twitter Data Analysis.
Scope of the Article: Natural Language Processing and Machine Translation.