Semantic similarity measurement based on low-dimensional sense vector model

doi:10.3969/j.issn.0253-2778.2016.09.002

Abstract

Abstract: Semantic similarity measurement enables the improvement of information retrieval in terms of accuracy and efficiency, so it has become one of the core components in text processing. To solve the problem of lexical ambiguity like polysemy, a sense vector model based on vector composition was proposed, which integrates knowledge base with corpus by fusing multiple semantic features derived from both of them. This model focuses on the continuous distributed word vectors and the inherent semantic properties in WordNet. Firstly, the continuous word vectors were trained from a textual corpus in advance by the neural network language model in deep learning. Then multiple semantic information and relationship information were extracted from WordNet to augment original vectors and generate sense vectors for words. Hence, the semantic similarity between concepts can be measured by the similarity of sense vectors. The experimental results on benchmark indicate that this measure outperforms state-of-the-art measures based on either WordNet or corpora. Compared with the measures based on original distributed word vectors, the proposed measure has an improvement of Pearson correlation coefficient (7.5%). The outstanding results also show the contribution of multiple feature fusion to measuring the conceptual semantic similarity.

Key words: sense vector, feature fusion, distributed word embedding, semantic similarity

CLC Number:

TP391

CAI Yuanyuan, LU Wei. Semantic similarity measurement based on low-dimensional sense vector model[J]. Journal of University of Science and Technology of China, 2016, 46(9): 719-726.

References

［1］
PALIWAL A V, SHAFIQ B, VAIDYA J, et al. Semantics-based automated service discovery[J]. IEEE Transactions on Services Computing, 2012, 5(2): 260-275.
[2] WANG X, ZHAO Y L, NIE L, et al. Semantic-based location recommendation with multimodal venue semantics[J]. IEEE Transactions on Multimedia, 2015, 17(3): 409-419.
[3] QUAN C, REN F. Unsupervised product feature extraction for feature-oriented opinion determination[J]. Information Sciences, 2014, 272(8): 16-28.
[4] MILLER G A. WordNet: A lexical database for English[J]. Communications of the ACM, 1995, 38(11): 39-41.
[5] 刘宏哲. 文本语义相似度计算方法研究[D]. 北京交通大学, 2012.
[6] PATWARDHAN S, PEDERSEN T. Using WordNet-based context vectors to estimate the semantic relatedness of concepts[C]//Proceedings of the EACL 2006 Workshop Making Sense of Sense-Bringing Computational Linguistics and Psycholinguistics Together. Trento，Italy: EACL Press, 2006, 1501: 1-8.
[7] PIRR G. A semantic similarity metric combining features and intrinsic information content[J]. Data & Knowledge Engineering, 2009, 68(11): 1289-1308.
[8] GAO J B, ZHANG B W, CHEN X H. A WordNet-based semantic similarity measurement combining edge-counting and information content theory[J]. Engineering Applications of Artificial Intelligence, 2015, 39: 80-88.
[9] PENNINGTON J, SOCHER R., MANNING C D. Glove: Global vectors for word representation[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2014: 1532-1543.
[10] XU R, CHEN T, XIA Y, et al. Word embedding composition for data imbalances in sentiment and emotion classification[J]. Cognitive Computation, 2015, 7(2): 226-240.
[11] BENGIO Y, SCHWENK H,SENCAL J S, et al. A neural probabilistic language model[J]. Journal of Machine Learning Research, 2003, 3(6): 1137-1155.
[12] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. Computer Science, 2013, arXiv: 1301.3781v3.
[13] IACOBACCI I, PILEHVAR M T, NAVIGLI R. SensEmbed: Learning sense embeddings for word and relational similarity[C]// Proceedings of the 53rd Association for Computational Linguistics and 7th International Conference on Natural Language Processing. 2015: 95-105.
[14] CHEN X, LIU Z, SUN M. A unified model for word sense representation and disambiguation[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2014: 1025-1035.
[15] GOIKOETXEA J, SOROA A, AGIRRE E, et al. Random walks and neural network language models on knowledge bases[C]// Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2015: 1434-1439.
[16] REISINGER J, MOONEY R J. Multi-prototype vector-space models of word meaning[C]//Proceedings of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Los Angeles: ACL Press, 2010: 109-117.
[17] HUANG E H, SOCHER R, MANNING C D, et al. Improving word representations via global context and multiple word prototypes[C]// Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL Press, 2012: 873-882.
[18] GUO J, CHE W, WANG H, et al. Learning sense-specific word embeddings by exploiting bilingual resources[C]// Proceedings of COLING. Dublin Ireland: ACM Press, 2014: 497-507.
[19] NEELAKANTAN A, SHANKAR J, PASSOS A, et al. Efficient non-parametric estimation of multiple embeddings per word in vector space[J]. arXiv e-print, 2015, arXiv:1504.06654.
[20] CHEN T, XU R F, HE Y L, et al. A gloss composition and context clustering based distributed word sense representation model[J]. Entropy, 2015, 17(9): 6007-6024.
[21] WANG H, GAO B, BIAN J, et al. Solving verbal comprehension questions in IQ test by knowledge-powered word embedding[J]. arXiv e-print, 2015, arXiv:1505.07909v4.

[22] MONTAGUE R. English as a formal language[J]. Linguaggi Nella Societ E Nella Tecnica Edizioni Di Comunita, 1970: 188-221.
[23] RUBENSTEIN H, GOODENOUGH J B. Contextual correlates of synonymy[J]. Communications of the ACM, 1965, 8(10): 627-633.
[24] SIMONOFF J S. Smoothing methods in statistics[J]. Journal of the American Statistical Association, 1997, 92(2): 379-384.

()
()

[1]	WANG Gensheng, PAN Fangzheng. Collaborative filtering recommendation algorithm based on semantic similarity [J]. Journal of University of Science and Technology of China, 2019, 49(10): 835-841.
[2]	HU Gensheng, SUN Yingying, XU Lingying, LIANG Dong, SUN Xiaoqi. Recognition of ancient Chinese characters based on hybrid kernel WLS-SVR [J]. Journal of University of Science and Technology of China, 2015, 45(4): 321-328.