Pattern-Sensitive Loanword Estimation for Thai Text Clustering

Authors

  • Burhan WANGLEM Faculty of Technology and Environment, Prince of Songkla University, Phuket Campus, Phuket 83120
  • Nattapong TONGTEP Faculty of Technology and Environment, Prince of Songkla University, Phuket Campus, Phuket 83120

Keywords:

Loanword detection, Pali word, Sanskrit word, Thai language, text clustering

Abstract

Writing style and language usage vary depending on the purpose of the writers and change the readability. A good assessment of text readability helps readers find suitable texts with less effort. In the Thai language, text readability assessment is one of the challenging tasks in natural language processing, because the Thai texts are not segmented by words and have only ambiguous boundary markers for word and sentence segmentation. Furthermore, loanwords, words borrowed from other languages such as Pali and Sanskrit, play important roles in text readability. In this paper, we propose a method to cluster Thai texts according to their readability by detecting loanwords that can be used as features. First, loanwords in Thai are categorized into 7 types as different patterns. Then the set of loanword patterns is employed to detect those patterns in the set of documents retrieved from the search engine. The experimental result shows that the detection of the Thai words that are loaned not only from Pali but also from Sanskrit achieved the highest F-measure up to 100 and 98.29 % accuracy.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

Author Biography

Burhan WANGLEM, Faculty of Technology and Environment, Prince of Songkla University, Phuket Campus, Phuket 83120

Faculty of Technology and Environment

References

National Statistical Office, Available at: http://www.nso.go.th, accessed August 2016.

K Collins-Thompson. Computational assessment of text readability: A survey of current and future research. ITL Int. J. Appl. Linguist. 2014; 165, 97-135.

P Daowadung and YH Chen. Using word segmentation and SVM to assess readability of Thai text for primary school students. In: Proceedings of the 8th International Joint Conference on Computer Science and Software Engineering. Nakhon Pathom, Thailand, 2011, p. 170-4.

N Tongtep, F Coenen and T Theeramunkong. Content-based readability assessment: A study using a syllabic alphabetic language (in Thai). In: Proceedings of the 13th Pacific Rim International Conference on Artificial Intelligence. Gold Coast, Australia, 2014, p. 863-70.

Samut Prakan School, Available at: http://www.prakan.ac.th, accessed May 2016.

S Phongphaiboon. Principles of Thai Language. Thai Watana Panich, Bangkok, 1991, p. 1-14.

K Tonglo. Principles of Thai Language. Ruam Sarn, Bangkok, 2007, p. 88-140.

S Makjeng. Pali and Sanskrit Language in Thai Language. Odeon Store, Bangkok, 1992, p. 12-4.

YH Chen and P Daowadung. Assessing readability of Thai text using support vector machines. Maejo Int. J. Sci. Tech. 2015; 9, 355-69.

N Tongtep and T Theeramunkong. Simultaneous character-cluster-based word segmentation and named entity recognition in Thai language. In: Proceedings of the 5th International Conference on Knowledge, Information, and Creativity Support Systems. Chiang Mai, Thailand, 2011, p. 216-25.

J Han, M Kamber and J Pei. Data Mining: Concepts and Techniques. Elsevier, MA, 2011, p. 364-9.

Google, Available at: https://www.google.co.th, accessed January 2016.

Statista, Available at: http://www.statista.com, accessed June 2016.

National Electronics and Computer Technology Center, Available at: http://www.sansarn.com, accessed June 2016.

Downloads

Published

2017-06-22

How to Cite

WANGLEM, B., & TONGTEP, N. (2017). Pattern-Sensitive Loanword Estimation for Thai Text Clustering. Walailak Journal of Science and Technology (WJST), 14(10), 813–823. Retrieved from https://wjst.wu.ac.th/index.php/wjst/article/view/4166