Journal of University of Science and Technology of China ›› 2017, Vol. 47 ›› Issue (4): 283-289.DOI: 10.3969/j.issn.0253-2778.2017.04.001

• Original Paper •     Next Articles

Corpus selection for Uyghur-Chinese machine translation based on bilingual sentence coverage

ZHU Shaolin, YANG Yating, MI Chenggang, LI Xiao,WANG Lei,   

  1. 1.The Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences, Urumqi 830011, China;
    2.Key Laboratory of Speech Language Information Processing of Xinjiang, Urumqi 830011, China;
    3.University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2016-03-01 Revised:2016-09-17 Online:2017-04-30 Published:2017-04-30

Abstract: When making the selection of corpora, information includes not only redundancy at the vocabulary level but also redundancy at the sentential level. Present methods for this purpose are mainly focused on selecting corpora at the vocabulary level of coverage. These methods can effectively reduce the redundancy of words and phrases, but does not take into account the level of sentence coverage. Aiming at selecting a smaller training corpus from large-scale bilingual corpus, in order to get a the same or better translation system than the mass training data, the corpus from sentence coverage was mainly selected, by combining unseen n-grams method and edit distance. The experimental results show that the proposed method uses less training corpus, but still achieves almost equivalent performance compared with the original training corpus.

Key words: statistical machine translation, sentence pairs, corpus selection

CLC Number: