Preprocessed Text Compression Method for Malayalam Text Files
Rincy T A1, Rajesh R2
1Ms. Rincy T A, Assistant Professor, Department of Computer Science, Prajyoti Niketan College, Pudukad, Thrissur, Kerala, India.
2Dr. Rajesh R, Associate Professor, CHRIST(Deemed to be University), Bengaluru, Karnataka, India.
Manuscript received on 01 March 2019 | Revised Manuscript received on 04 March 2019 | Manuscript published on 30 July 2019 | PP: 1011-1015 | Volume-8 Issue-2, July 2019 | Retrieval Number: B1806078219/19©BEIESP | DOI: 10.35940/ijrte.B1806.078219
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: The increasing importance of Unicode for text files implies an increase in storage space required for data and the time for the transmission of data, with a corresponding need for compression of data. Conventional compressors fair purely on UTF-8 texts, where each character can span multiple bytes. Malayalam which is one among the four major languages of the Dravidian family, is represented by using Unicode characters. The contribution of this paper is a reversible transformation mapping of the input to reduce the actual size of the input file before a general purpose compression method. After the preprocessing, LZW compression achieves more compression to Malayalam text files containing any characters including ASCII characters. This method can be extended to any native language files containing mostly the characters of only one script.
Index Terms: Data Compression, Unicode, LZW, UTF-8, Compression Ratio.
Scope of the Article: Data Mining Methods, Techniques, and Tools