基于数据划分的k-近邻分类加速算法机理分析

doi:10.3969/j.issn.0253-2778.2018.04.009

中国科学技术大学学报 ›› 2018, Vol. 48 ›› Issue (4): 331-340.DOI: 10.3969/j.issn.0253-2778.2018.04.009

基于数据划分的k-近邻分类加速算法机理分析

宋云胜，王杰，梁吉业，

1.山西大学计算机与信息技术学院，山西太原 030006；2.计算智能与中文信息处理教育部重点实验室（山西大学），山西太原 030006

收稿日期:2017-09-19 修回日期:2018-04-11 出版日期:2018-04-30 发布日期:2018-04-30
通讯作者: 梁吉业
作者简介:宋云胜, 男, 1984年生, 博士生, 研究方向: 机器学习.E-mail: sys_sd@126.com
基金资助:
国家自然科学基金重点项目(U1435212, 61432011), 山西省重点科技攻关项目(MQ2014-09)资助.

Mechanism analysis of the accelerator for k-nearest neighbor algorithm based on data partition

SONG Yunsheng,WANG Jie, LIANG Jiye,

1. School of Computer & Information Technology, Shanxi University, Taiyuan 030006；

Received:2017-09-19 Revised:2018-04-11 Online:2018-04-30 Published:2018-04-30

摘要/Abstract

摘要： k-近邻(kNN)分类算法因具有不对数据分布做任何假设、操作简单且泛化性能较强的特点，在人脸识别、文本分类、情感分析等领域被广泛使用.kNN分类算法不需要训练过程，其简单存储训练实例并根据测试实例与存储的训练实例进行相似度比较来预测分类.由于kNN分类算法需要计算测试实例与所有训练实例之间的相似度，故难以高效地处理大规模数据.为此提出将寻找近邻的过程转化为一个优化问题,并给出了原始优化问题与使用数据划分优化问题的最优解下目标函数差异的估计.通过对此估计的理论分析表明，聚类划分可以有效的减小此差异，进而保证基于聚类的k-近邻分类(DC-kNN)算法具有较强的泛化性能.在公开数据集的实验结果显示，DC-kNN分类算法在很大程度上为测试实例提供了与原始kNN分类算法相同的k个近邻进而获得较高的分类精度.

关键词: k-近邻, 数据划分, 局部信息, 实例子集, 聚类

Abstract: Due to its absence of hypotheses for the underlying distributions of data, simple execution and strong generation ability, k-nearest neighbor classification algorithm (kNN) is widely used in face recognition, text classification, emotional analysis and other fields. kNN does not need the training process, but it only stores the training instances until the unlabeled instance appears, and executes the predicted process. However, kNN needs to compute the similarity between the unlabeled instance and all the training instances, hence it is difficult to deal with large-scale data. To overcome this difficulty, #br##br# the process of computing the nearest neighbors is converted to a constrained optimization problem, and an estimation is given of difference on the value of the objective function under the optimal solution with and without data partition. The theoretical analysis of this estimation indicates that data partition using clustering can reduce this difference, and the k-nearest neighbor algorithm based on clustering can have a strong generation ability. Experiment results on public datasets show that the k-nearest neighbor algorithm based on clustering can largely obtain the same nearest neighbors of raw kNN, thus obtaining higher classification accuracy.

Key words: k-nearest neighbor, data partition, local information, instance subset, clustering

中图分类号:

TP391

宋云胜，王杰，梁吉业，. 基于数据划分的k-近邻分类加速算法机理分析[J]. 中国科学技术大学学报, 2018, 48(4): 331-340.

SONG Yunsheng,WANG Jie, LIANG Jiye,. Mechanism analysis of the accelerator for k-nearest neighbor algorithm based on data partition[J]. Journal of University of Science and Technology of China, 2018, 48(4): 331-340.

参考文献

［1］
COVER T, HART P. Nearest neighbor pattern classification[J]. IEEE Transactions on Information Theory, 2002, 13(1): 21-27.
[2] XU B H, FU Y W, JIANG Y G, et al. Heterogeneous knowledge transfer in video emotion recognition, attribution and summarization[J]. IEEE Transactions on Affective Computing, 2018, 9(2): 255-270.
[3] 周志华. 机器学习[M]. 北京: 清华大学出版社, 2016.
[4] 张敏灵. 一种新型多标记懒惰学习算法[J]. 计算机研究与发展, 2012, 49(11): 2271-2282.
[5] FRIEDMAN J, HASTIE T, TIBSHIRANI R. The Elements of Statistical Learning[M]. Berlin: Springer, 2001.
[6] WU X, KUMAR V, QUINLAN J R, et al. Top 10 algorithms in data mining[J]. Knowledge and Information Systems, 2008, 14(1): 1-37.
[7] KONONENKO I, KUKAR M. Machine Learning and Data Mining: Introduction to Principles and Algorithms[M]. Chichester: Harwood Publishing Limited, 2007.
[8] LI Y, MAGUIRE L. Selecting critical patterns based on local geometrical and statistical information[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(6): 1189-1201.
[9] MUJA M, LOWE D G. Scalable nearest neighbor algorithms for high dimensional data[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(11): 2227-2240.
[10] MARCHIORI E. Class conditional nearest neighbor for large margin instance selection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(2): 364-370.
[11] ANGIULLI F. Fast nearest neighbor condensation for large data sets classification[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(11): 1450-1464.
[12] LIU T, MOORE A W, YANG K, et al. An investigation of practical approximate nearest neighbor algorithms[C]// Proc of Conference on Neural Information Processing Systems. Vancouver, Canada: MIT Press, 2005: 825-832.
[13] MCFEE B, LANCKRIET G R G. Large-scale music similarity search with spatial trees[C]// Proceedings of the 12th International Society for Music Information Retrieval Conference. Florida: ISMIR Press, 2014: 566-574.
[14] GARCIA S, DERRAC J, CANO J, et al. Prototype selection for nearest neighbor classification: Taxonomy and empirical study[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(3): 417-435.
[15] HSIEH C J, SI S, DHILLON I S. A divide-and-conquer solver for kernel support vector machines[C]// Proceedings of the 27th International Conference on Machine Learning. Haifa: IMLS Press, 2014: 566-574.
[16] FRIEDMAN J H, BENTLEY J L, FINKEL R A. An algorithm for finding best matches in logarithmic expected time[J]. ACM Transactions on Mathematical Software, 1977, 3(3): 209-226.
[17] VERMA N, KPOTUFE S, DASGUPTA S. Which spatial partition trees are adaptive to intrinsic dimension?[C]// Proceedings of the 25h Conference on Uncertainty in Artificial Intelligence. Montreal, Canada: AUAI Press, 2009: 565-574.
[18] SLANEY M, CASEY M. Locality-sensitive hashing for finding nearest neighbors [J]. IEEE Signal Processing Magazine, 2008, 25(2): 128-131.
[19] OLVERA-LPEZ J A, CARRASCO-OCHOA J A, MARTNEZ-TRINIDAD J F, et al. A review of instance selection methods[J]. Artificial Intelligence Review, 2010, 34(2): 133-143.
[20] BRIGHTON H, MELLISH C. Advances in instance selection for instance-based learning algorithms[J]. Data Mining and Knowledge Discovery, 2002, 6(2): 153-172.
[21] HART P. The condensed nearest neighbor rule[J]. IEEE Transactions on Information Theory, 1968, 14(3): 515-516.

[22] ANGIULLI F, FOLINO G. Distributed nearest neighbor-based condensation of very large data sets[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(12): 1593-1606.
[23] NIKOLAIDIS K, GOULERMAS J Y, WU Q H. A class boundary preserving algorithm for data condensation[J]. Pattern Recognition, 2011, 44(3): 704-715.
[24] ZHANG H, SUN G. Optimal reference subset selection for nearest neighbor classification by tabu search[J]. Pattern Recognition, 2002, 35(7): 1481-1490.
[25] 王熙照, 王亚东, 湛燕, 等. 学习特征权值对 K-均值聚类算法的优化[J]. 计算机研究与发展, 2003, 40(6): 869-873.
[26] 杨润玲, 高新波. 基于加权模糊c均值聚类的快速图像自动分割算法[J]. 中国图象图形学报, 2007, 12(12): 2105-2112.
[27] DEMAR J. Statistical comparisons of classifiers over multiple data sets[J]. Journal of Machine Learning Research, 2006, 7(1): 1-30.

[28] CHANG C C, LIN C J. LIBSVM: A library for support vector machines[J]. ACM Transactions on Intelligent Systems and Technology, 2011, 2(3): No.27.
[29] BLAKE C L, MERZ C J. UCI Repository of machine learning databases[EB/OL]. [2017-06-15] http://www. ics. uci. edu/~ mlearn/MLRepository. Html, 1998.
[30] WILCOXON F. Individual comparisons by ranking methods[J]. Biometrics Bulletin, 1945, 1(6): 80-83.

[31] GARCA-OSORIO C, HARO-GARCA A, GARCA-PEDRAJAS N. Democratic instance selection: A linear complexity instance selection algorithm based on classifier ensemble concepts[J]. Artificial Intelligence, 2010, 174(5): 410-441.
[32] KORDOS M, BLACHNIK M, STRZEMPA D. Do we need whatever more than k-NN?[C]// Proceedings of the 10th International Conference on Artificial Intelligence and Soft Computing. Zakopane, Poland: Springer, 2010: 414-421.

()
(

[1]	王悦，李京. 基于可视化的卷积神经网络优化方法研究[J]. 中国科学技术大学学报, 2020, 50(7): 959-967.
[2]	张怿恺，彭勇，孔万增，文益民. 图正则化的模糊局部坐标编码概念分解模型[J]. 中国科学技术大学学报, 2020, 50(7): 993-1002.
[3]	龚乐君，周佘海，程逸飞，高志宏，李华康. 单细胞RNA序列数据的PBMC相关细胞的识别[J]. 中国科学技术大学学报, 2020, 50(7): 1013-1018.
[4]	张玉州，张子为. 基于合作协同进化的多回收站点垃圾收运问题求解[J]. 中国科学技术大学学报, 2020, 50(5): 695-704.
[5]	罗彪，刘暄. 基于后向轨迹模式下的合肥市大气污染来源研究[J]. 中国科学技术大学学报, 2019, 49(4): 321-328.
[6]	张滨凯，王翔，郑津津. 结合字典学习和稀疏聚类的医学图像分割算法[J]. 中国科学技术大学学报, 2019, 49(10): 791-796.
[7]	唐风琴，2，丁文文. 基于谱聚类带有节点特征的社区发现算法[J]. 中国科学技术大学学报, 2018, 48(2): 162-173.
[8]	王佳玉，张振宇，褚征，吴晓红. 一种基于轨迹数据密度分区的分布式并行聚类方法[J]. 中国科学技术大学学报, 2018, 48(1): 47-56.
[9]	饶齐，杨燕*，滕飞. 基于多视图加权聚类集成的高速列车工况识别[J]. 中国科学技术大学学报, 2018, 48(1): 35-41.
[10]	万越，隋杰. 基于用户行为影响的微博突发话题检测方法[J]. 中国科学技术大学学报, 2017, 47(4): 328-335.
[11]	徐雪丽，赵学靖. 稀疏谱聚类算法在高维数据上的应用[J]. 中国科学技术大学学报, 2017, 47(4): 311-319.
[12]	张辉，赵静. 基于朝向对比度的无监督边界检测算法[J]. 中国科学技术大学学报, 2017, 47(1): 26-31.
[13]	罗维佳，乔少杰，韩楠，元昌安，闭应洲，舒红平. 面向LBSN的k-medoids聚类算法[J]. 中国科学技术大学学报, 2017, 47(1): 70-79.
[14]	缪裕青，高韩，刘同来，文益民. 基于网格聚类的情感分析研究[J]. 中国科学技术大学学报, 2016, 46(10): 874-882.

基于数据划分的k-近邻分类加速算法机理分析

Mechanism analysis of the accelerator for k-nearest neighbor algorithm based on data partition

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 14

编辑推荐

Metrics

本文评价