Visualization and Clustering of Text Retrieval
AbstractIn a fast transforming world where all objects will be generating data, dealing with large data collections has been a major concern for data scientists. Major challenges faced by those scientists are among others the difficulty to represent these data in a better way and therefore to communicate hidden information from these data to the users. Accordingly, many data analysis and data visualization techniques have been proposed. Moreover, depending on the nature of data to visualize and the type of information to communicate, a certain number of data processing techniques should be considered. In this work, we analyze and visualize a sample data of TREC-6 from the TREC (Text Retrieval Conference) collections. TREC document collections comprise full text from newspapers articles and US government records. They are primarily dedicated to researchers in Information Retrieval (IR) systems and Natural Language Processing for the development of their works. First, documents are parsed and words extracted to build a corpus in a form of a matrix. Then, Principal Component Analysis is applied to the corpus matrix to reduce the dimension from to 2. Eventually, the unsupervised K-means algorithm is used to discriminate data into clusters that are interactively visualized thanks to the popular visualization tools such as Pie Chart, Stacked Bar Chart and Scatter Chart. The diversity of the nature of information contained in TREC-6 can be observed thanks to the most frequent words of each cluster that appear on the Bar Chart upon clicking on the Pie Chart of the corresponding cluster.
. C. Kelleher and T. Wagener, "Ten guidelines for effective data visualization in scientific publications," Environmental Modelling & Software, vol. 26, no. 6, pp. 822-827, June 2011.
. N. I. o. S. a. Technology. [Online]. Available: http://trec.nist.gov.
. L. I. Smith, "A tutorial on Principal Components Analysis," 2002.
. A. N. Gorban and Z. Y. Andrei, "PCA and K-Means Decipher Genome," in Principal Manifolds for Data Visualization and Dimension Reduction, vol. 58, Heidelberg, Springer, 2008, pp. 309-323.
. M. a. NIST. [Online]. Available: https://math.nist.gov/javanumerics/jama/.
. Y. Gyu Jung, M. Soo Kang and J. Heo, "Clustering performance comparison using K-means and expectation maximization algorithms," Taylor & Francis, 19 June 2014.
. M. . Steinbach, G. Karypis and . V. Kumar, "A Comparison of Document Clustering Techniques," in KDD workshop on text mining, Minneapolis, Minnesota, 2000.
. P. Fournier-Viger. [Online]. Available: http://data-mining.philippe-fournier-viger.com/.
. D. . J. Sloane, "Visualizing Qualitative Information," The Qualitative Report, vol. 14, no. 3, pp. 488-497, 2009.
Copyright (c) 2021 American Scientific Research Journal for Engineering, Technology, and Sciences (ASRJETS)
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who submit papers with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
- By submitting the processing fee, it is understood that the author has agreed to our terms and conditions which may change from time to time without any notice.
- It should be clear for authors that the Editor In Chief is responsible for the final decision about the submitted papers; have the right to accept\reject any paper. The Editor In Chief will choose any option from the following to review the submitted papers:A. send the paper to two reviewers, if the results were negative by one reviewer and positive by the other one; then the editor may send the paper for third reviewer or he take immediately the final decision by accepting\rejecting the paper. The Editor In Chief will ask the selected reviewers to present the results within 7 working days, if they were unable to complete the review within the agreed period then the editor have the right to resend the papers for new reviewers using the same procedure. If the Editor In Chief was not able to find suitable reviewers for certain papers then he have the right to accept\reject the paper.B. sends the paper to a selected editorial board member(s). C. the Editor In Chief himself evaluates the paper.
- Author will take the responsibility what so ever if any copyright infringement or any other violation of any law is done by publishing the research work by the author
- Before publishing, author must check whether this journal is accepted by his employer, or any authority he intends to submit his research work. we will not be responsible in this matter.
- If at any time, due to any legal reason, if the journal stops accepting manuscripts or could not publish already accepted manuscripts, we will have the right to cancel all or any one of the manuscripts without any compensation or returning back any kind of processing cost.
- The cost covered in the publication fees is only for online publication of a single manuscript.