基于结构和文本特征的网页分类技术研究

doi:10.3969/j.issn.0253-2778.2017.04.002

中国科学技术大学学报 ›› 2017, Vol. 47 ›› Issue (4): 290-296.DOI: 10.3969/j.issn.0253-2778.2017.04.002

基于结构和文本特征的网页分类技术研究

顾敏，郭庆，曹野，朱峰，顾彦慧，周俊生，曲维光，

1.南京师范大学计算机科学与技术学院，江苏南京 210023; 2.福建省信息处理与智能控制重点实验室,闽江学院，福建福州 350121

收稿日期:2016-03-01 修回日期:2016-09-17 出版日期:2017-04-30 发布日期:2017-04-30
通讯作者: 顾彦慧
作者简介:顾敏，女，1993年生，硕士生.研究方向：自然语言处理. E-mail:15205150477@163.com
基金资助:
国家自然科学基金(61472191),江苏省高等学校自然科学基金(15KJA420001),留学回国人员科研启动基金(教外司留[2015]1098号),福建省信息处理与智能控制重点实验室(闽江学院)开放基金(MJUKF201705),山东省语言资源开发与应用重点实验室开放课题(211180A41601),江苏省普通高校研究生科研创新计划(KYLX16_1293)资助.

Research on web page automatic categorization based on structural and text information

GU Min, GUO Qing, CAO Ye, ZHU Feng, GU Yanhui, ZHOU Junsheng, QU Weiguang,

1. School of Computer Science and Technology, Nanjing Normal University, Nanjing 210023, China;
2. Fujian Province Key Laboratory of Information Processing and Intelligence Control, Minjiang University, Fuzhou 350121， China

Received:2016-03-01 Revised:2016-09-17 Online:2017-04-30 Published:2017-04-30

摘要/Abstract

摘要： Web网页中含有丰富的信息资源，通过网页分类可以更好地对其内容进行抽取和管理，方便用户阅读.针对网页复杂的结构信息和丰富的文本内容，提出了一种基于网页文本和结构的网页分类方法，利用众创相关网页的结构特点和文本信息，选择联合特征和原子特征相结合的方法进行分类.实验表明，这种方法有一定的可行性，且比单一使用文本信息进行分类的方法具有更高的正确率和召回率.

关键词: 网页分类, 朴素贝叶斯, 原子特征, 联合特征

Abstract: Since web pages contain abundant information resources, a better extraction and management of the information can be achieved through web page categorization. Considering the complex structure and abundant text information, a method was proposed for web page categorization based on the structure and text. The method of combining joint features and atomic features was employed to classify the web pages. The experiment result shows that the proposed method is feasible to some extent and has a higher precision and recall rate than using text information only.

Key words: web page classification, nave Bayes, atomic feature, joint feature

中图分类号:

TP391

顾敏，郭庆，曹野，朱峰，顾彦慧，周俊生，曲维光，. 基于结构和文本特征的网页分类技术研究[J]. 中国科学技术大学学报, 2017, 47(4): 290-296.

GU Min, GUO Qing, CAO Ye, ZHU Feng, GU Yanhui, ZHOU Junsheng, QU Weiguang,. Research on web page automatic categorization based on structural and text information[J]. Journal of University of Science and Technology of China, 2017, 47(4): 290-296.

参考文献

［1］
孙建涛, 沈抖, 陆玉昌,等. 网页分类技术[J]. 清华大学学报(自然科学版), 2004, 44(1):65-68.
SUN Jiantao, SHEN Dou, LU Yuchang, et al. Web classification technology[J]. Chinese Journal of Tsinghua University(Natural Science), 2004, 44(1): 65-68.
[2] Fürnkranz J. Exploiting structural information for text classification on the WWW[C]// Proceedings of the 3rd International Symposium on Advances in Intelligent Data Analysis. Springer, 1999: 487-498.
[3] SHEN D, SUN J T, YANG Q, et al. A comparison of implicit and explicit links for web page classification[C]// Proceedings of the 15th International Conference on World Wide Web. Edinburgh, UK: ACM Press, 2006: 643-650.
[4] SRIURAI W, MEESAD P, HARUECHAIYASAK C. Improving web page classification by integrating neighboring pages via a topic model[C]// 10th International Conference on Innovative Internet Community Systems. Bonn, Germany: Gesellschaft Für Informatik, 2010: 238-246.
[5] JING X Y, LIU Q, WU F, et al. Web page classification based on uncorrelated semi-supervised intra-view and inter-view manifold discriminant feature extraction[C]// Proceedings of the 24th International Conference on Artificial Intelligence. Buenos Aires, Agentina: AAAI Press, 2015: 2255-2261.
[6] MARKOV A, LAST M, KANDEL A. Model-based classification of web documents represented by graphs[C]//Proceedings of the Conjunction with the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Philadelphia, USA: ACM Press, 2006. oai:CiteSeerX.psu:10.1.1.86.3971.
[7] 范焱, 郑诚, 王清毅, 等. 用Naive Bayes方法协调分类Web网页[J]. 软件学报, 2001, 12(9):1386-1392.
FAN Yan, ZHEN Cheng, WANG Qingyi, et al. Using nave Bayes to Coordinate the classification Web pages[J]. Chinese Journal of Software, 2001, 12(9): 1386-1392.
[8] 张亮, 叶允明, 于水,等. SLMBSVMs-KNN: 一种新的网页分类算法[C]// 全国搜索引擎和网上信息挖掘学术讨论会. 北京, 2003: 80-85.
[9] 李蓉, 孙媛. SVM-KNN分类器在网页分类中的应用[J]. 科学技术与工程, 2009, 9(16): 4653-4656.
LI Rong, SUN Yuan. Application of SVM-KNN classifier into Web page classification[J]. Chinese Journal of Science Technology and Engineering, 2009, 9(16): 4653-4656.
[10] SUN S, LIU F, LIU J, et al. Web Classification Using Deep Belief Networks[C]// Proceedings of the 17th International Conference on Computational Science and Engineering. Chengdu, China: IEEE Computer Society, 2014: 768-773.
[11] QUEK C Y, MITCHELL T. Classification of world wide web documents[J]. Senior Honors Thesis, 1997: 1-12.
[12] YANG Y, SLATTERY S, GHANI R. A Study of Approaches to HypertextCategorization[J]. Journal of Intelligent Information Systems, 2002, 18(2): 219-241.
[13] 刘欣. 基于结构信息的中文网页自动分类技术研究[D]. 南京: 南京航空航天大学, 2010.
[14] 侯小静, 王黎明. 利用HTML标签筛选网页分类样本[J]. 微机发展, 2005, 15(3): 142-144.
HOU Xiaojing, WANG Liming. Using HTML tag to screen Web page classification [J]. Chinese Microcomputer Development, 2005, 15(3): 142-144.
[15] 郭晓, 蒋宗礼. 基于网页结构与链接关系的中文文本分类方法[J]. 现代电子技术, 2010, 33(22): 54-56.
GUO Xiao, JIANG Zongli. A novel Chinese text classification method using webpage tags and hyperlinks[J]. Chinese Journal of Modern Electronic Technology, 2010, 33(22): 54-56.
[16] 兰均, 施化吉, 李星毅,等. 基于特征词复合权重的关联网页分类[J]. 计算机科学, 2011, 38(3):187-190.
LAN jun, SHI Huaji, LI Xingyi, et al. Associated web page classification based on the weight of composite features[J]. Chinese journal of computer science, 2011, 38(3):187-190.
[17] 张海雷, 王会珍, 王安慧,等. 基于朴素贝叶斯模型的垃圾邮件过滤技术比较分析[A]// 全国网络与信息安全技术研讨会论文集(下册). 2007: 551-557.
[18] 王振宇, 唐远华, 郭力. 面向分层结构的网页分类与抓取[J]. 计算机工程与科学, 2012, 34(11): 1-6.
WANG Zhenyu, TANG Yuanhua, GUO Li. Categorization and extraction of Web pages based on Hierarchy [J]. Chinese Journal of Computer Engineering and Science, 2012, 34(11): 1-6.

[1]	张其龙，邓维斌，胡峰，瞿原，胡宗容. 一种基于朴素贝叶斯的校准标签排序方法[J]. 中国科学技术大学学报, 2018, 48(1): 65-74.
[2]	赖英旭，许昕，杨震. 基于尾项加权的自适应文本分类方法研究[J]. 中国科学技术大学学报, 2011, 41(7): 607-614.

基于结构和文本特征的网页分类技术研究

Research on web page automatic categorization based on structural and text information

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 2

编辑推荐

Metrics

本文评价