Journal of University of Science and Technology of China ›› 2019, Vol. 49 ›› Issue (7): 517-523.DOI: 10.3969/j.issn.0253-2778.2019.07.001

    Next Articles

Unsupervised identification of Malay domain multiword expressions

WANG Lin   

  1. 1.Xianda College of Economics and Humanities, Shanghai International Studies University, Shanghai 200083, China; 2.Laboratory of Language Engineering and Computing, Guangdong University of Foreign Studies, Guangzhou 510420, China
  • Received:2018-06-15 Revised:2018-09-18 Online:2019-07-31 Published:2019-07-31

Abstract: Multiword expression (MWE) is an optimal granularity of language reuse. However, no explicit formal boundaries between MWEs and other words cause a serious problem on automatic identification of MWEs for some non-common languages. We address the identification issue of Malay domain MWEs, and propose a natural-annotation-based unsupervised extraction and clustering algorithm.In the novel algorithm, we firstly use a binary classification for each space character to solve length-varying Malay MWEs extraction, secondly transfer natural document-level category annotations to MWE-level ones for Malay MWEs clustering, and finally distill out several domain datasets of MWEs after filtering general MWEs.The experimental results in the Malay dataset of 272 783 text documents show that our algorithm can extract MWEs precisely and dispatch them into domain lexicons efficiently.

Key words: unsupervised identification, multiword expression, domain lexicon, natural annotation, Malay