An Apriori Method for Topic Extraction from Text Files
Anil Kumar K. M1, Ajay B2, Shashank R3, Amogha Subramanya D. A4
1Anil Kumar K M, Department of CS & E, JSS Science and Technology University, Mysore, India
2Ajay B, Department of CS & E, JSS Science and Technology University, Mysore, India.
3Shashank R, Department of CS & E, JSS Science and Technology University, Mysore, India
4Amogha Subramanya D A, Department of CS & E, JSS Science and Technology University, Mysore, India.
Manuscript received on 13 March 2019 | Revised Manuscript received on 19 March 2019 | Manuscript published on 30 July 2019 | PP: 2516-2521 | Volume-8 Issue-2, July 2019 | Retrieval Number: A3068058119/19©BEIESP | DOI: 10.35940/ijrte.A3068.078219
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: In this data age peta-bytes of data is generated every day. One of the biggest challenge today is to convert this data into
useful information, this is known as data mining. Important kinds of data include text-based data, audio-based data, image-based data, video-based data etc. An important challenge in mining useful information from text-based data source (text mining) is topic modeling which is to find out the topic the text is talking about. The solution to this problem finds application, in clustering files based on the topic, pre-processing method in information retrieval, ontology of medical record etc. A lot of research work has gone into this area of topic modeling, and many approaches have been formulated. Some of these approaches take into account the occurrence and frequency of occurrence of words/terms, these models come under the Bag Of Words(BOW) approach. Others take into account the underlying structure in the corpus of text used, Wikipedia category graph is an example of this approach. This paper, provides an unsupervised solution to the above problem by extracting keywords that represent the topic of the text document. In our approach, topic modeling is carried out with a hybrid model which makes use of WordNet and Wikipedia Corpus. Promising experimental results have been obtained for well- known news dataset (BBCNews) from our model. We present the experimental result for our proposed approach along with the results of others in the same domain and show that our approach provides better results.
Index Terms: Bag of Phrases, Cosine Similarity, Key Phrases, Occurrence Matrix, Keyword Extraction.
Scope of the Article: Text Mining