A Statistical Model for Automatic Error Detection and Correction of Assamese Words
M P Bhuyan1, S K Sarma2
1M P Bhuyan, Department of Information Technology, Gauhati University, Guwahati, India.
2S K Sarma, Department of Information Technology, Gauhati University, Guwahati, India.
Manuscript received on 10 March 2019 | Revised Manuscript received on 18 March 2019 | Manuscript published on 30 July 2019 | PP: 6111-6116 | Volume-8 Issue-2, July 2019 | Retrieval Number: B3859078219/19©BEIESP | DOI: 10.35940/ijrte.B3859.078219
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: Digitization of local languages is getting importance in the present scenario and the Language Processing task is also becoming popular among the Linguistic and IT people. It is very common that most of the people are comfortable with their native mother tongue. Writing of corrected word-form is also an important task in the digital platforms for the future existence of a language. In this research work, the Assamese language is taken as a Natural Language which is processed in the experiments. The Assamese language is one of the Indian languages and the research & development of the Assamese language is going on; from the computational point of view, Assamese is in the development phase. In Assamese, there are some similar characters which are phonetically same but their glyphs are different these characters or symbols often cause confusion to the users while writing, these types of characters are specially taken into consideration in this research work. A list of 14 confusing characters pairs of Assamese letters is taken for experimental purpose. In addition, this research work has focused on errors of Assamese words, which are checked by using bigram and trigram models. Moreover, the proposed model also tries to find the erroneous character which causes the incorrectness and shows the suggestions for that incorrect character. A score based system is designed for the Assamese characters and each character is assigned a score from their probability of occurrences by using bigram and trigram language models. Different types of experiments are performed to check the correctness of the Assamese words and the proposed model is able to check the correctness of the Assamese word with accuracy ranging from 81% to 86%. Error rate in Assamese can be reduced by using this model in any digital platform where a user can type in Assamese.
Index Terms: Assamese Language, Assamese Word, Bigram, Probability, Score, and Trigram.
Scope of the Article: Natural Language Processing