A Modified Hierarchical Agglomerative Approach for Efficient Document Clustering System

Authors

  • May Thu Lwin Department of Computer Engineering and Information Technology, MTU, Mandalay, Myanmar
  • Moe Moe Aye Department of Computer Engineering and Information Technology, MTU, Mandalay, Myanmar

Keywords:

Document Clustering, Agglomerative Hierarchical Clustering (AHC) algorithm, Similarity Measures, F-measure, Optimized Bubble Sort Algorithm.

Abstract

In today’s world, the increasing volume of text documents has brought challenges for their effective and efficient organization. This has led to an enormous demand for efficient tools that turn data into valuable knowledge. One of the techniques that can play an important role towards the achievement of this objective is document clustering. The main function of document clustering is automatic grouping of documents so that the documents within a cluster are very similar, but dissimilar to the documents in other clusters. This research proposes a Modified Agglomerative Hierarchical Clustering (MAHC) algorithm based on hierarchical method. In many traditional systems, the number of term frequency is considered to create data representation matrix. However, a modified algorithm creates data representation matrix based only on occurrence of items, not on frequency of items. The proposed algorithm can increase the quality of clustering because it can merge the related or similar documents into the same cluster efficiently. Moreover, the proposed algorithm can reduce the processing time than the existing methods. In this paper, the performance of clustering between the proposed and original clustering algorithm was compared and evaluated by using F-measure.

References

[1] J. Han and M. Kamber, "Data Mining concepts and techniques", Morgan Kaufmann Publishers, Second edition, 2009.
[2] R. Xu, “Survey of clustering algorithms”, IEEE Transactions on Neural Network, Vol.16, No.3, May, 2005.
[3] T. Su and C. A. Murthy, “A new hierarchical approach for document clustering”, Journal of Pattern Recognition Research, 66-84, August, 2013.
[4] M. Paul and P. Thangam, “A modified hierarchical clustering algorithm for document clustering algorithms”, International Journal of Advanced Research in Computer Engineering and Technology, IJARCET, Vol.2, Issue-6, June, 2013.
[5] Pushplasta and R. Chatterjee, “Analytical assessment on document clustering”, International Journal of Computer Network and Information Security, IJCNIS, Vol.5, 63-71, June, 2012.
[6] M. Shafiei and S. Wang and et.al, “Document representation and dimension reduction for text clustering”, IEEE 23rd International Conference on Data Engineering Workshop, April, 2017.
[7] A. Haung, “Similarity measures for text document clustering”, Proceedings of the New Zealand Computer Science Research Student Conference (NZCSRSC), April, 2008
[8] Y. Zhao and G. Karypis, “Evaluation of hierarchical clustering algorithms for document datasets”, CIKM, Virginia, USA, November 4-9, 2002.
[9] P. Berkhin, “Survey of clustering data mining techniques”
[10] J. Alnihoud and R. Mansi, “An enhancement of major sorting algorithm”, The International Arab Journal of Information Technology”, Vol.7, No-1, January, 2010.
[11] https://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html.

Downloads

Published

2017-03-29

How to Cite

Lwin, M. T., & Aye, M. M. (2017). A Modified Hierarchical Agglomerative Approach for Efficient Document Clustering System. American Scientific Research Journal for Engineering, Technology, and Sciences, 29(1), 228–238. Retrieved from https://asrjetsjournal.org/index.php/American_Scientific_Journal/article/view/2773

Issue

Section

Articles