基于双语句对覆盖度的维汉机器翻译语料选取技术

doi:10.3969/j.issn.0253-2778.2017.04.001

中国科学技术大学学报 ›› 2017, Vol. 47 ›› Issue (4): 283-289.DOI: 10.3969/j.issn.0253-2778.2017.04.001

• 论著 • 下一篇

基于双语句对覆盖度的维汉机器翻译语料选取技术

朱少林，杨雅婷，米成刚，李晓，王磊，

1.中国科学院新疆理化技术研究所，新疆乌鲁木齐 830011；
2. 新疆民族语音语言信息处理重点实验室，新疆乌鲁木齐 830011；
3. 中国科学院大学，北京 100049

收稿日期:2016-03-01 修回日期:2016-09-17 出版日期:2017-04-30 发布日期:2017-04-30
通讯作者: 杨雅婷
作者简介:朱少林，男，1989年生，博士生.研究方向：机器学习、信息处理. E-mail: zhushaolin003@163.com
基金资助:
国家自然科学基金（61473001，71071045，71131002）资助.

Corpus selection for Uyghur-Chinese machine translation based on bilingual sentence coverage

ZHU Shaolin, YANG Yating, MI Chenggang, LI Xiao,WANG Lei,

1.The Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences, Urumqi 830011, China;
2.Key Laboratory of Speech Language Information Processing of Xinjiang, Urumqi 830011, China;
3.University of Chinese Academy of Sciences, Beijing 100049, China

Received:2016-03-01 Revised:2016-09-17 Online:2017-04-30 Published:2017-04-30

摘要/Abstract

摘要： 在进行语料的选取时，语料中的冗余信息包括词汇和句子层面的冗余.目前的方法主要集中在词汇层次的语料覆盖度进行选取，这种方法可以有效地降低词或者短语的信息冗余，但是没有考虑句子层次的覆盖度. 为了从大规模的双语语料中选取较小规模的训练语料，得到与大规模训练相同甚至更优的翻译系统，基于双语句对覆盖度进行平行语料的选取，提出一种将unseen n-grams和编辑距离相结合进行语料的选取的方法.实验结果表明，该方法可以在使用较少训练语料的情况下，得到与原始训练翻译效果相同的翻译系统.

关键词: 统计机器翻译, 双语句对, 语料选取

Abstract: When making the selection of corpora, information includes not only redundancy at the vocabulary level but also redundancy at the sentential level. Present methods for this purpose are mainly focused on selecting corpora at the vocabulary level of coverage. These methods can effectively reduce the redundancy of words and phrases, but does not take into account the level of sentence coverage. Aiming at selecting a smaller training corpus from large-scale bilingual corpus, in order to get a the same or better translation system than the mass training data, the corpus from sentence coverage was mainly selected, by combining unseen n-grams method and edit distance. The experimental results show that the proposed method uses less training corpus, but still achieves almost equivalent performance compared with the original training corpus.

Key words: statistical machine translation, sentence pairs, corpus selection

中图分类号:

TP391

朱少林，杨雅婷，米成刚，李晓，王磊，. 基于双语句对覆盖度的维汉机器翻译语料选取技术[J]. 中国科学技术大学学报, 2017, 47(4): 283-289.

ZHU Shaolin, YANG Yating, MI Chenggang, LI Xiao,WANG Lei,. Corpus selection for Uyghur-Chinese machine translation based on bilingual sentence coverage[J]. Journal of University of Science and Technology of China, 2017, 47(4): 283-289.

参考文献

［1］
CHAO W H, LI Z J. A Graph-based bilingual corpus selection approach for SMT[C]// Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation. Singapore: Waseda University Press, 2011: 120-129.
[2] CUI L, ZHANG D D, LIU S J, et al. Collective corpus weighting and phrase scoring for SMT using graph-based random walk[C]// The 2nd Conference on Natural Language Processing & Chinese Computing. Chongqing, China, 2013: 176-187.
[3] ECK M, VOGEL S, WAIBEL A. Low cost portability for statistical machine translation based on n-gram coverage[C]// International Workshop on Spoken Language Translation. Pittsburgh, USA: IWSLT Press, 2005: 61-67.
[4] MANDAL A, VERGYRI D, WANG W, et al. Efficient data selection for machine translation[C]// Spoken Language Technology Workshop. Goa, India: IEEE Press, 2008: 261-264.
[5] SKADIA I, BRLTIS E. English-Latvian SMT: knowledge or data? [C]// Proceedings of the 17th NODALIDA Conference Processing, http://beta.visl.sdu.dk/~eckhard/nodalida/paper_57.pdf, 2009: 242-245.
[6] HAN X W, LI H Z, ZHAO T J. Train the machine with what it can learn: Corpus selection for SMT[C]// Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: From Parallel to Non-Parallel Corpora. Suntec, Singapore: ACM Press, 2009: 27-33.
[7] 王志洋,吕雅娟,刘群. 面向形态丰富语言的多粒度翻译融合[J]. 中文信息学报. 2011, 25(4): 75-81.
WANG Z Y, LV Y J, LIU Q. System combination with multiple granularities for morphologically rich language translation[J]. Journal of Chinese Information Processing, 2011, 25(4): 75-81.
[8] 米莉万·雪合来提, 刘凯，吐尔根·依布拉音. 基于维语尔语词干词缀粒度的汉维机器翻译[J]. 中文信息学报, 2015， 29(3): 201-206.
MILIWAN·XUEHELAITI, LIU KAI, TURGUN·IBRAHIM. Chinese-Uyghur machine translation based on smallest translation units of stem and suffixes[J]. Journal of Chinese Information Processing, 2015, 29(3):201-206.
[9] HAN J W, JI H, SUN Y Z. Successful data mining methods for NLP[C]// Proceedings of the Tutorials of the 53rd Annual Meeting of the ACL and the 7th International Joint Conference on Natural Language Processing. Beijing, China: ACL Press, 2015: 1-4.
[10] LIU L, HONG Y, LIU H, et al. Effective selection of translation model training data[C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore, USA: IEEE Press, 2014: 569-573.
[11] HILDEBRAND A S, ECK M, VOGEL S, et al. Adaptation of the translation model for statistical machine translation based on information retrieval[C]// Proceedings of the 10th Annual Conference on European Association for Machine Translation. San Diego, USA: ACM Press, 2005: 133-142.
[12] 黄瑾,吕雅娟,刘群. 基于信息检索方法的统计翻译系统训练数据选择与优化[J]. 中文信息学报, 2008, 22(2): 40-46.
HUANG Jin, LV Yajun, lIU Qun. The statistical translation system based on information retrieval method selection and optimization of training data[J]. Journal of Chinese Information Processing, 2008, 22(2): 40-46.
[13] 姚树杰, 肖桐, 朱靖波. 基于句对质量和覆盖度的统计机器翻译训练语料选取[J]. 中文信息学报, 2011, 25(1): 72-77.
YAO Shujie, XIAO Tong, ZHU Jingbo. Selection of SMT training data based on sentence pair quality and coverage[J]. Journal of Chinese Information Processing, 2011, 25(1): 72-77.
[14] 王星, 涂兆鹏, 谢军, 等. 一种基于分类的平行语料选取方法[J]. 中文信息学报, 2013, 27(6): 144-150.
WANG Xing, TU Zhaopeng, XIE Jun, etal. Selection of parallel corpus based on classification[J]. Journal of Chinese Information Processing, 2013, 27(6): 144-150.
[15] KIRCHHOFF K, BILMES J. Submodularity for data selection in statistical machine translation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: ACL Press, 2014: 131-141.

[1]	占永昆, 杨文飞, 张天柱. 基于关系建模的弱监督时序动作定位[J]. 中国科学技术大学学报, 2021, 51(10): 753-765.
[2]	梁启鹏, 朱巧慧, 康宇, 赵云波. 基于估计的无线网络化控制系统逼近控制策略[J]. 中国科学技术大学学报, 2021, 51(4): 327-334.
[3]	李慧聪, 傅孝明∗ . 基于局部等距嵌入的各向异性曲面网格生成[J]. 中国科学技术大学学报, 2020, 50(12): 1460-1471.
[4]	于俊，李雅洁，程礼磊，连顺，谭昶，丁德成，刘淇. 高教程序代码作业抄袭检测的方法研究与实践[J]. 中国科学技术大学学报, 2020, 50(8): 1048-1057.
[5]	张文静，班志杰. 基于全局的引文网络影响力最大化算法[J]. 中国科学技术大学学报, 2020, 50(8): 1058-1063.
[6]	任敏，许玲，王峰，吴超. 基于知识推荐的校园百科平台研究[J]. 中国科学技术大学学报, 2020, 50(8): 1072-1076.
[7]	孙磊，张义宁，薛艳芳，乔立山，张丽梅. 自适应功能连接网络学习及其在脑疾病识别中的应用[J]. 中国科学技术大学学报, 2020, 50(8): 1102-1109.
[8]	白江波，杨阳，张文生. 基于改进CycleGAN的合成孔径雷达图像仿真[J]. 中国科学技术大学学报, 2020, 50(8): 1181-1186.
[9]	辛守宇，郑蕊蕊，周瑜，刘文鹏，贺建军. 训练过程中使用支持集信息的单样本学习算法[J]. 中国科学技术大学学报, 2020, 50(8): 1187-1192.
[10]	汪志华，康红梅. 使用三角网格上的样条求解带有非齐次边界的PDE[J]. 中国科学技术大学学报, 2020, 50(7): 901-905.
[11]	龚乐君，周佘海，程逸飞，高志宏，李华康. 单细胞RNA序列数据的PBMC相关细胞的识别[J]. 中国科学技术大学学报, 2020, 50(7): 1013-1018.
[12]	杜淑颖，杜鹏，丁世飞. 基于CNN的假冒域名识别方法研究[J]. 中国科学技术大学学报, 2020, 50(7): 1019-1025.
[13]	李永军，曹为华，凌强. 基于特征点轨迹的多目标跟踪算法[J]. 中国科学技术大学学报, 2020, 50(6): 726-732.
[14]	江琦，刘建宏，关勇，白浩波，刘刚，田扬超. 一种针对不完善数据的基于全变分约束的相干衍射算法[J]. 中国科学技术大学学报, 2020, 50(4): 418-427.
[15]	王旭辉，燕明叶，吴梦. 适用于等几何分析退化光滑插值曲面片构造[J]. 中国科学技术大学学报, 2020, 50(3): 335-342.

基于双语句对覆盖度的维汉机器翻译语料选取技术

Corpus selection for Uyghur-Chinese machine translation based on bilingual sentence coverage

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价