Pattern-Sensitive Loanword Estimation for Thai Text Clustering

Burhan WANGLEM, Nattapong TONGTEP


Writing style and language usage vary depending on the purpose of the writers and change the readability. A good assessment of text readability helps readers find suitable texts with less effort. In the Thai language, text readability assessment is one of the challenging tasks in natural language processing, because the Thai texts are not segmented by words and have only ambiguous boundary markers for word and sentence segmentation. Furthermore, loanwords, words borrowed from other languages such as Pali and Sanskrit, play important roles in text readability. In this paper, we propose a method to cluster Thai texts according to their readability by detecting loanwords that can be used as features. First, loanwords in Thai are categorized into 7 types as different patterns. Then the set of loanword patterns is employed to detect those patterns in the set of documents retrieved from the search engine. The experimental result shows that the detection of the Thai words that are loaned not only from Pali but also from Sanskrit achieved the highest F-measure up to 100 and 98.29 % accuracy.


Loanword detection, Pali word, Sanskrit word, Thai language, text clustering

Full Text:



