Mechanism analysis of the accelerator for k-nearest neighbor algorithm based on data partition

doi:10.3969/j.issn.0253-2778.2018.04.009

Abstract

Abstract: Due to its absence of hypotheses for the underlying distributions of data, simple execution and strong generation ability, k-nearest neighbor classification algorithm (kNN) is widely used in face recognition, text classification, emotional analysis and other fields. kNN does not need the training process, but it only stores the training instances until the unlabeled instance appears, and executes the predicted process. However, kNN needs to compute the similarity between the unlabeled instance and all the training instances, hence it is difficult to deal with large-scale data. To overcome this difficulty, #br##br# the process of computing the nearest neighbors is converted to a constrained optimization problem, and an estimation is given of difference on the value of the objective function under the optimal solution with and without data partition. The theoretical analysis of this estimation indicates that data partition using clustering can reduce this difference, and the k-nearest neighbor algorithm based on clustering can have a strong generation ability. Experiment results on public datasets show that the k-nearest neighbor algorithm based on clustering can largely obtain the same nearest neighbors of raw kNN, thus obtaining higher classification accuracy.

Key words: k-nearest neighbor, data partition, local information, instance subset, clustering

CLC Number:

TP391

SONG Yunsheng,WANG Jie, LIANG Jiye,. Mechanism analysis of the accelerator for k-nearest neighbor algorithm based on data partition[J]. Journal of University of Science and Technology of China, 2018, 48(4): 331-340.

References

［1］
COVER T, HART P. Nearest neighbor pattern classification[J]. IEEE Transactions on Information Theory, 2002, 13(1): 21-27.
[2] XU B H, FU Y W, JIANG Y G, et al. Heterogeneous knowledge transfer in video emotion recognition, attribution and summarization[J]. IEEE Transactions on Affective Computing, 2018, 9(2): 255-270.
[3] 周志华. 机器学习[M]. 北京: 清华大学出版社, 2016.
[4] 张敏灵. 一种新型多标记懒惰学习算法[J]. 计算机研究与发展, 2012, 49(11): 2271-2282.
[5] FRIEDMAN J, HASTIE T, TIBSHIRANI R. The Elements of Statistical Learning[M]. Berlin: Springer, 2001.
[6] WU X, KUMAR V, QUINLAN J R, et al. Top 10 algorithms in data mining[J]. Knowledge and Information Systems, 2008, 14(1): 1-37.
[7] KONONENKO I, KUKAR M. Machine Learning and Data Mining: Introduction to Principles and Algorithms[M]. Chichester: Harwood Publishing Limited, 2007.
[8] LI Y, MAGUIRE L. Selecting critical patterns based on local geometrical and statistical information[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(6): 1189-1201.
[9] MUJA M, LOWE D G. Scalable nearest neighbor algorithms for high dimensional data[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(11): 2227-2240.
[10] MARCHIORI E. Class conditional nearest neighbor for large margin instance selection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(2): 364-370.
[11] ANGIULLI F. Fast nearest neighbor condensation for large data sets classification[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(11): 1450-1464.
[12] LIU T, MOORE A W, YANG K, et al. An investigation of practical approximate nearest neighbor algorithms[C]// Proc of Conference on Neural Information Processing Systems. Vancouver, Canada: MIT Press, 2005: 825-832.
[13] MCFEE B, LANCKRIET G R G. Large-scale music similarity search with spatial trees[C]// Proceedings of the 12th International Society for Music Information Retrieval Conference. Florida: ISMIR Press, 2014: 566-574.
[14] GARCIA S, DERRAC J, CANO J, et al. Prototype selection for nearest neighbor classification: Taxonomy and empirical study[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(3): 417-435.
[15] HSIEH C J, SI S, DHILLON I S. A divide-and-conquer solver for kernel support vector machines[C]// Proceedings of the 27th International Conference on Machine Learning. Haifa: IMLS Press, 2014: 566-574.
[16] FRIEDMAN J H, BENTLEY J L, FINKEL R A. An algorithm for finding best matches in logarithmic expected time[J]. ACM Transactions on Mathematical Software, 1977, 3(3): 209-226.
[17] VERMA N, KPOTUFE S, DASGUPTA S. Which spatial partition trees are adaptive to intrinsic dimension?[C]// Proceedings of the 25h Conference on Uncertainty in Artificial Intelligence. Montreal, Canada: AUAI Press, 2009: 565-574.
[18] SLANEY M, CASEY M. Locality-sensitive hashing for finding nearest neighbors [J]. IEEE Signal Processing Magazine, 2008, 25(2): 128-131.
[19] OLVERA-LPEZ J A, CARRASCO-OCHOA J A, MARTNEZ-TRINIDAD J F, et al. A review of instance selection methods[J]. Artificial Intelligence Review, 2010, 34(2): 133-143.
[20] BRIGHTON H, MELLISH C. Advances in instance selection for instance-based learning algorithms[J]. Data Mining and Knowledge Discovery, 2002, 6(2): 153-172.
[21] HART P. The condensed nearest neighbor rule[J]. IEEE Transactions on Information Theory, 1968, 14(3): 515-516.

[22] ANGIULLI F, FOLINO G. Distributed nearest neighbor-based condensation of very large data sets[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(12): 1593-1606.
[23] NIKOLAIDIS K, GOULERMAS J Y, WU Q H. A class boundary preserving algorithm for data condensation[J]. Pattern Recognition, 2011, 44(3): 704-715.
[24] ZHANG H, SUN G. Optimal reference subset selection for nearest neighbor classification by tabu search[J]. Pattern Recognition, 2002, 35(7): 1481-1490.
[25] 王熙照, 王亚东, 湛燕, 等. 学习特征权值对 K-均值聚类算法的优化[J]. 计算机研究与发展, 2003, 40(6): 869-873.
[26] 杨润玲, 高新波. 基于加权模糊c均值聚类的快速图像自动分割算法[J]. 中国图象图形学报, 2007, 12(12): 2105-2112.
[27] DEMAR J. Statistical comparisons of classifiers over multiple data sets[J]. Journal of Machine Learning Research, 2006, 7(1): 1-30.

[28] CHANG C C, LIN C J. LIBSVM: A library for support vector machines[J]. ACM Transactions on Intelligent Systems and Technology, 2011, 2(3): No.27.
[29] BLAKE C L, MERZ C J. UCI Repository of machine learning databases[EB/OL]. [2017-06-15] http://www. ics. uci. edu/~ mlearn/MLRepository. Html, 1998.
[30] WILCOXON F. Individual comparisons by ranking methods[J]. Biometrics Bulletin, 1945, 1(6): 80-83.

[31] GARCA-OSORIO C, HARO-GARCA A, GARCA-PEDRAJAS N. Democratic instance selection: A linear complexity instance selection algorithm based on classifier ensemble concepts[J]. Artificial Intelligence, 2010, 174(5): 410-441.
[32] KORDOS M, BLACHNIK M, STRZEMPA D. Do we need whatever more than k-NN?[C]// Proceedings of the 10th International Conference on Artificial Intelligence and Soft Computing. Zakopane, Poland: Springer, 2010: 414-421.

()
(

[1]	WANG Yue, LI Jing. Research on optimization method of convolutional neural network based on visualization [J]. Journal of University of Science and Technology of China, 2020, 50(7): 959-967.
[2]	ZHANG Yikai, PENG Yong, KONG Wanzeng, WEN Yimin. Fuzzy local coordinate concept factorization with graph regularization [J]. Journal of University of Science and Technology of China, 2020, 50(7): 993-1002.
[3]	GONG Lejun , ZHOU Shehai, CHENG Yifei, GAO Zhihong, LI Huakang. Identification of PBMC-related cells of single cell RNA sequence data [J]. Journal of University of Science and Technology of China, 2020, 50(7): 1013-1018.
[4]	ZHANG Yuzhou, ZHANG Ziwei. Solving multi-station refuse collection problem based on cooperative co-evolutionary algorithm [J]. Journal of University of Science and Technology of China, 2020, 50(5): 695-704.
[5]	ZHANG Binkai, WANG Xiang, ZHENG Jinjin. Medical image segmentation algorithm based on dictionary learning and sparse clustering [J]. Journal of University of Science and Technology of China, 2019, 49(10): 791-796.
[6]	TANG Fengqin,2， DING Wenwen. Community detection based on spectral clustering with node attributes [J]. Journal of University of Science and Technology of China, 2018, 48(2): 162-173.
[7]	WANG Jiayu, ZHANG Zhenyu, CHU Zheng, WU Xiaohong. A trajectory data density partition based distributed parallel clustering method [J]. Journal of University of Science and Technology of China, 2018, 48(1): 47-56.
[8]	RAO Qi, YANG Yan*, TENG Fei. Condition recognition of high-speed train based on multi-view weighted clustering ensemble [J]. Journal of University of Science and Technology of China, 2018, 48(1): 35-41.
[9]	XU Xueli, ZHAO Xuejing. Application of sparse spectral clustering algorithm in high-dimensional data [J]. Journal of University of Science and Technology of China, 2017, 47(4): 311-319.
[10]	ZHANG Zhenya, ZHANG Meng, XIE Chenlei,ZHANG Zhaoxiang, FANG Qiansheng. Research on leakage location of secondary water distribution networks [J]. Journal of University of Science and Technology of China, 2017, 47(4): 336-341.
[11]	ZHANG Hui, ZHAO Jing. An unsupervised boundary detection algorithm based on orientation contrast model [J]. Journal of University of Science and Technology of China, 2017, 47(1): 26-31.
[12]	LUO Weijia, QIAO Shaojie, HAN Nan, YUAN Changan, BI Yingzhou, SHU Hongping. A k-medoids based clustering algorithm in location based social networks [J]. Journal of University of Science and Technology of China, 2017, 47(1): 70-79.