中国科学技术大学学报 ›› 2017, Vol. 47 ›› Issue (4): 283-289.DOI: 10.3969/j.issn.0253-2778.2017.04.001

• 论著 •    下一篇

基于双语句对覆盖度的维汉机器翻译语料选取技术

朱少林,杨雅婷,米成刚,李晓,王磊,   

  1. 1.中国科学院新疆理化技术研究所,新疆乌鲁木齐 830011;
    2. 新疆民族语音语言信息处理重点实验室,新疆乌鲁木齐 830011;
    3. 中国科学院大学,北京 100049
  • 收稿日期:2016-03-01 修回日期:2016-09-17 出版日期:2017-04-30 发布日期:2017-04-30
  • 通讯作者: 杨雅婷
  • 作者简介:朱少林,男,1989年生,博士生.研究方向:机器学习、信息处理. E-mail: zhushaolin003@163.com
  • 基金资助:
    国家自然科学基金(61473001,71071045,71131002)资助.

Corpus selection for Uyghur-Chinese machine translation based on bilingual sentence coverage

ZHU Shaolin, YANG Yating, MI Chenggang, LI Xiao,WANG Lei,   

  1. 1.The Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences, Urumqi 830011, China;
    2.Key Laboratory of Speech Language Information Processing of Xinjiang, Urumqi 830011, China;
    3.University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2016-03-01 Revised:2016-09-17 Online:2017-04-30 Published:2017-04-30

摘要: 在进行语料的选取时,语料中的冗余信息包括词汇和句子层面的冗余.目前的方法主要集中在词汇层次的语料覆盖度进行选取,这种方法可以有效地降低词或者短语的信息冗余,但是没有考虑句子层次的覆盖度. 为了从大规模的双语语料中选取较小规模的训练语料,得到与大规模训练相同甚至更优的翻译系统,基于双语句对覆盖度进行平行语料的选取,提出一种将unseen n-grams和编辑距离相结合进行语料的选取的方法.实验结果表明,该方法可以在使用较少训练语料的情况下,得到与原始训练翻译效果相同的翻译系统.

关键词: 统计机器翻译, 双语句对, 语料选取

Abstract: When making the selection of corpora, information includes not only redundancy at the vocabulary level but also redundancy at the sentential level. Present methods for this purpose are mainly focused on selecting corpora at the vocabulary level of coverage. These methods can effectively reduce the redundancy of words and phrases, but does not take into account the level of sentence coverage. Aiming at selecting a smaller training corpus from large-scale bilingual corpus, in order to get a the same or better translation system than the mass training data, the corpus from sentence coverage was mainly selected, by combining unseen n-grams method and edit distance. The experimental results show that the proposed method uses less training corpus, but still achieves almost equivalent performance compared with the original training corpus.

Key words: statistical machine translation, sentence pairs, corpus selection

中图分类号: