Short Text Document Clustering using Distributed Word Representation and Document Distance

Supavit KONGWUDHIKUNAKORN; Kitsana WAIYAMAI

doi:10.48048/wjst.2019.4133

Authors

Supavit KONGWUDHIKUNAKORN Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok 10900
Kitsana WAIYAMAI Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok 10900

DOI:

https://doi.org/10.48048/wjst.2019.4133

Keywords:

Distributed word representation, document distance, short text documents, short text documents clustering

Abstract

This paper presents a method for clustering short text documents, such as instant messages, SMS, or news headlines. Vocabularies in the texts are expanded using external knowledge sources and represented by a Distributed Word Representation. Clustering is done using the K-means algorithm with Word Mover's Distance as the distance metric. Experiments were done to compare the clustering quality of this method, and several leading methods, using large datasets from BBC headlines, SearchSnippets, StackExchange, and Twitter. For all datasets, the proposed algorithm produced document clusters with higher accuracy, precision, F1-score, and Adjusted Rand Index. We also observe that cluster description can be inferred from keywords represented in each cluster.

Downloads

Download data is not yet available.

References

S Liang, E Yilmaz and E Kanoulas. Dynamic clustering of streaming short documents. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, USA, 2016, p. 995-1004.

A Karandikar. 2010, Clustering short status messages: A topic model based approach, Master’s thesis. Faculty of the Graduate School, University of Maryland, Maryland, USA.

L Rokach and O Maimon. Clustering Methods. Springer US, Boston, MA, 2005, p. 321-52.

CC Aggarwal and C Zhai. A Survey of Text Clustering Algorithms. Springer US, Boston, MA, 2012, p. 77-128.

G Salton and MJ McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, USA, 1986.

KS Jones. A statistical interpretation of term specificity and its application in retrieval. J. Document. 1972; 28, 11-21.

J Ramos. Using tf-idf to determine word relevance in document queries. In: Proceedings of the 1st Informational Conference on Machine Learning, Piscataway, USA, 2003.

J Xu, P Wang, G Tian, B Xu, J Zhao, F Wang and H Hao. Short text clustering via convolutional neural networks. In: Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, Colorado, USA, 2015, p. 62-9.

S Deerwester, ST Dumais, GW Furnas, TK Landauer and R Harshman. Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci. 1990; 41, 391-407.

T Hofmann. Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, USA, 1999, p. 50-7.

DM Blei, AY Ng and MI Jordan. Latent dirichlet allocation. J. Mach. Learn. Res. 2003; 3, 993-1022.

VKR Sridhar. Unsupervised topic modeling for short texts using distributed representations of words. In: Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, Denver, Colorado, USA, 2015, p. 192-200.

X Quan, C Kit, Y Ge and SJ Pan. Short and sparse text topic modeling via self-aggregation. In: Proceedings of the 24th International Conference on Artificial Intelligence, Buenos Aires, Argentina, 2015, p. 2270-6.

N Kalchbrenner, E Grefenstette and P Blunsom. A convolutional neural network for modelling sentences. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Maryland, USA, 2014, p. 655-65.

J Xu, B Xu, P Wang, S Zheng, G Tian, J Zhao and B Xu. Self-taught convolutional neural networks for short text clustering. Neural Netw. 2017; 88, 22-31.

Y Yan, R Huang, C Ma, L Xu, Z Ding, R Wang, T Huang and B Liu. Improving document clustering for short texts by long documents via a dirichlet multinomial allocation model. In: Proceedings of the 1st International Joint Conference of Web and Big Data, Beijing, China, 2017, p. 626-41.

C MA, Q Zhao, J Pan and Y Yan. Short text classification based on distributional representations of words. IEICE Trans. Inform. Syst. 2016; 99, 2562-5.

L Hong and BD Davison. Empirical study of topic modeling in twitter. In: Proceedings of the 1st Workshop on Social Media Analytics, New York, USA, 2010, p. 80-8.

J Weng, EP Lim, J Jiang and Q He. Twitterrank: Finding topic-sensitive influential twitterers. In: Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, New York, USA, 2010, p. 261-70.

R Mehrotra, S Sanner, W Buntine and L Xie. Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, USA, 2013, p. 889-92.

T Mikolov, I Sutskever, K Chen, G Corrado and J Dean. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, Nevada, USA, 2013, p. 3111-9.

J MacQueen. Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, USA, 1967, p. 281-97.

MJ Kusner, Y Sun, NI Kolkin and KQ Weinberger. From word embeddings to document distances. In: Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015, p. 957-66.

T Mikolov, K Chen, G Corrado and J Dean. Efficient estimation of word representations in vector space. In: Proceedings of the International Conference on Learning Representations, Scottsdale, Arizona, USA, 2013.

DE Rumelhart, JL McClelland and C PDP Research Group. Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol. 1, MIT Press, Cambridge, USA, 1986.

JL Elman. Distributed representations, simple recurrent networks, and grammatical structure. Mach. Learn. 1991; 7, 195-225.

Y Rubner, C Tomasi and LJ Guibas. A metric for distributions with applications to image databases. In: Proceedings of the Sixth International Conference on Computer Vision, Washington DC, USA, 1998.

Y Rubner, C Tomasi and LJ Guibas. The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vision 2000; 40, 99-121.

G Salton and C Buckley. Term-weighting approaches in automatic text retrieval. Inform. Process. Manag. 1988; 24, 513-23.

A Huang. Similarity measures for text document clustering. In: Proceedings of the Sixth New Zealand Computer Science Research Student Conference, Christchurch, New Zealand, 2008, p. 49-56.

TS Madhulatha. An overview on clustering methods. Intell. Data Anal. 2007; 11, 583-605.

R Rehurek and P Sojka. Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, 2010, p. 45-50.

D Sailaja, M Kishore, B Jyothi and N Prasad. An overview of pre-processing text clustering methods. Int. J. Comput. Sci. Inform. Tech. 2015; 6, 3119-24.

AI Kadhim, YN Cheah and NH Ahamed. Text document preprocessing and dimension reduction techniques for text document clustering. In: Proceedings of the 4th International Conference on Artificial Intelligence and Applications in Engineering and Technology, Kota Kinabalu, Sabah, Malaysia, 2014, p. 69-73.

D Greene and P Cunningham. Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, Pennsylvania, USA, 2006, p. 377-84.

XH Phan, LM Nguyen and S Horiguchi. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, New York, USA, 2008, p. 91-100.

S Vijayarani, MJ Ilamathi and M Nithya. Preprocessing techniques for text mining: An overview. Int. J. Comput. Sci. Comm. Netw. 2015; 5, 7-16.

M Speriosu, N Sudan, S Upadhyay and J Baldridge. Twitter polarity classification with label propagation over lexical links and the follower graph. In: Proceedings of the 1st Workshop on Unsupervised Learning in NLP, Stroudsburg, USA, 2011, p. 53-63.

CD Boom, SV Canneyt, T Demeester and B Dhoedt. Representation learning for very short texts using weighted word embedding aggregation. Pattern Recogn. Lett. 2016; 80, 150-6.

S Wagner and D Wagner. Comparing Clusterings: An Overview. Fakultät für Informatik, Universität Karlsruhe, Germany, 2007, p. 1-19.

F Wilcoxon. Individual comparisons by ranking methods. Biometrics Bull. 1945; 1, 80-3.