中国科学技术大学学报 ›› 2019, Vol. 49 ›› Issue (7): 517-523.DOI: 10.3969/j.issn.0253-2778.2019.07.001

• 原创论文 •    下一篇

马来语领域多词组无监督识别

王 琳   

  1. 1.上海外国语大学贤达经济人文学院,上海 200083;2.广东外语外贸大学语言工程与计算实验室,广东广州 510420
  • 收稿日期:2018-06-15 修回日期:2018-09-18 出版日期:2019-07-31 发布日期:2019-07-31
  • 通讯作者: 刘伍颖
  • 作者简介:王琳,女,1983年生,硕士/讲师.研究方向:计算语言学和语料库语言学. E-amil: lwang@xdsisu.edu.cn
  • 基金资助:
    上海市社科规划项目(2019BYY028);国家语委重点项目(ZDI135-26);广东省自然科学基金(2018A030313672);广州市人文社科重点研究基地重点项目(2017-IC-02)资助.

Unsupervised identification of Malay domain multiword expressions

WANG Lin   

  1. 1.Xianda College of Economics and Humanities, Shanghai International Studies University, Shanghai 200083, China; 2.Laboratory of Language Engineering and Computing, Guangdong University of Foreign Studies, Guangzhou 510420, China
  • Received:2018-06-15 Revised:2018-09-18 Online:2019-07-31 Published:2019-07-31

摘要: 多词组是一种优化的语言复用粒度.,由于一些非通用语言的多词组与词之间缺乏显式形态边界,导致多词组自动识别困难.针对马来语领域多词组识别问题,提出一种基于自然标注的无监督抽取与聚类算法.算法首先采用空格符二值分类实现变长马来语多词组抽取;然后将文档级的自然类别标注迁移到多词组级类别聚类;最后过滤掉通用多词组,萃取多个领域多词组数据集.在272 783马来语文本文档数据集上的实验结果表明,提出的算法不但能够精准地抽取多词组,而且能够高效地实现多词组领域词典聚类.

关键词: 无监督识别, 多词组, 领域词典, 自然标注, 马来语

Abstract: Multiword expression (MWE) is an optimal granularity of language reuse. However, no explicit formal boundaries between MWEs and other words cause a serious problem on automatic identification of MWEs for some non-common languages. We address the identification issue of Malay domain MWEs, and propose a natural-annotation-based unsupervised extraction and clustering algorithm.In the novel algorithm, we firstly use a binary classification for each space character to solve length-varying Malay MWEs extraction, secondly transfer natural document-level category annotations to MWE-level ones for Malay MWEs clustering, and finally distill out several domain datasets of MWEs after filtering general MWEs.The experimental results in the Malay dataset of 272 783 text documents show that our algorithm can extract MWEs precisely and dispatch them into domain lexicons efficiently.

Key words: unsupervised identification, multiword expression, domain lexicon, natural annotation, Malay