Bipartite Graph Energy Based Similarity measure for Document Clustering
G. Hannah Grace1, Kalyani Desikan2
1G. Hannah Grace, Division of Mathematics, School of advance science, VIT university, Chennai, (Tamil Nadu), India.
2Kalyani Desikan, Division of Mathematics, School of advance science, VIT university, Chennai, (Tamil Nadu), India.
Manuscript received on 13 March 2019 | Revised Manuscript received on 20 March 2019 | Manuscript published on 30 March 2019 | PP: 194-200 | Volume-7 Issue-6, March 2019 | Retrieval Number: F2183037619/19©BEIESP
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: Document clustering is a text mining technique wherein a document collection is divided into significant clusters by making use of a suitable distance or similarity measure. Distance measure plays an important role in document clustering. Here similar content is assigned to the same clusters while dissimilar content is assigned to different clusters. This is achieved by minimizing the intra-cluster distance between documents and maximizing the distance between clusters. A variety of distance measures used in document clustering are Euclidean distance, Squared Euclidean distance, Minkowski distance, Chebychev distance, power distance, percent disagreement, Manhattan distance, Bit- Vector distance, comparative-clustering distance, Huffman-code distance and Dominance-based distance. In this paper we have introduced a new similarity measure namely, Bipartite Graph Energy Based Similarity (BGEBS) based on the energy of a bipartite graph for document clustering. BGEBS helps to cluster the documents by considering the energy of a bipartite graph representation of the document collection. We have compared our measure BGEBS with Euclidean, Jaccard, Cosine, Canberra, Manhattan and Maximum Distance and clustering is carried out using k-means to form clusters. We then compare and analyze our result with a synthetic data set containing 6 documents. we have also evaluated using few benchmark data sets like CLASSIC, WEBKB and BBC. To validate our measure we have used the internal quality measure, sum of squares within (SSW). The values obtained using SSW for the various distance measures when compared to our BGEBS proves to be good.
Keywords: Bipartite Graph, Document clustering, Similarity measure, Distance measures.
Scope of the Article: Graph Algorithms and Graph Drawing