Journal of University of Science and Technology of China ›› 2017, Vol. 47 ›› Issue (4): 290-296.DOI: 10.3969/j.issn.0253-2778.2017.04.002

• Original Paper • Previous Articles     Next Articles

Research on web page automatic categorization based on structural and text information

GU Min, GUO Qing, CAO Ye, ZHU Feng, GU Yanhui, ZHOU Junsheng, QU Weiguang,   

  1. 1. School of Computer Science and Technology, Nanjing Normal University, Nanjing 210023, China;
    2. Fujian Province Key Laboratory of Information Processing and Intelligence Control, Minjiang University, Fuzhou 350121, China
  • Received:2016-03-01 Revised:2016-09-17 Online:2017-04-30 Published:2017-04-30

Abstract: Since web pages contain abundant information resources, a better extraction and management of the information can be achieved through web page categorization. Considering the complex structure and abundant text information, a method was proposed for web page categorization based on the structure and text. The method of combining joint features and atomic features was employed to classify the web pages. The experiment result shows that the proposed method is feasible to some extent and has a higher precision and recall rate than using text information only.

Key words: web page classification, nave Bayes, atomic feature, joint feature

CLC Number: