Visualization and Clustering of Text Retrieval

  • Abdel Naser Pouamoun International Computer Institute, Ege University, 35100, Izmir, Turkey
Keywords: data visualization, TREC collections, Principal Component Analysis, K-means algorithm, clustering

Abstract

In a fast transforming world where all objects will be generating data, dealing with large data collections has been a major concern for data scientists. Major challenges faced by those scientists are among others the difficulty to represent these data in a better way and therefore to communicate hidden information from these data to the users. Accordingly, many data analysis and data visualization techniques have been proposed. Moreover, depending on the nature of data to visualize and the type of information to communicate, a certain number of data processing techniques should be considered. In this work, we analyze and visualize a sample data of TREC-6 from the TREC (Text Retrieval Conference) collections. TREC document collections comprise full text from newspapers articles and US government records. They are primarily dedicated to researchers in Information Retrieval (IR) systems and Natural Language Processing for the development of their works. First, documents are parsed and words extracted to build a corpus in a form of a matrix. Then, Principal Component Analysis is applied to the corpus matrix to reduce the dimension from  to 2. Eventually, the unsupervised K-means algorithm is used to discriminate data into clusters that are interactively visualized thanks to the popular visualization tools such as Pie Chart, Stacked Bar Chart and Scatter Chart. The diversity of the nature of information contained in TREC-6 can be observed thanks to the most frequent words of each cluster that appear on the Bar Chart upon clicking on the Pie Chart of the corresponding cluster.

References

. C. Kelleher and T. Wagener, "Ten guidelines for effective data visualization in scientific publications," Environmental Modelling & Software, vol. 26, no. 6, pp. 822-827, June 2011.

. N. I. o. S. a. Technology. [Online]. Available: http://trec.nist.gov.

. L. I. Smith, "A tutorial on Principal Components Analysis," 2002.

. A. N. Gorban and Z. Y. Andrei, "PCA and K-Means Decipher Genome," in Principal Manifolds for Data Visualization and Dimension Reduction, vol. 58, Heidelberg, Springer, 2008, pp. 309-323.

. M. a. NIST. [Online]. Available: https://math.nist.gov/javanumerics/jama/.

. Y. Gyu Jung, M. Soo Kang and J. Heo, "Clustering performance comparison using K-means and expectation maximization algorithms," Taylor & Francis, 19 June 2014.

. M. . Steinbach, G. Karypis and . V. Kumar, "A Comparison of Document Clustering Techniques," in KDD workshop on text mining, Minneapolis, Minnesota, 2000.

. P. Fournier-Viger. [Online]. Available: http://data-mining.philippe-fournier-viger.com/.

. D. . J. Sloane, "Visualizing Qualitative Information," The Qualitative Report, vol. 14, no. 3, pp. 488-497, 2009.

Published
2021-02-14
Section
Articles