中国科学技术大学学报 ›› 2016, Vol. 46 ›› Issue (1): 56-65.DOI: 10.3969/j.issn.0253-2778.2016.01.008

• 论著 • 上一篇    

面向Tabular库的数据模型及其查询问题

黄冬梅,孙乐,石少华,苏诚,赵丹枫   

  • 收稿日期:2015-08-27 修回日期:2015-09-29 接受日期:2015-09-29 出版日期:2015-09-29 发布日期:2015-09-29
  • 通讯作者: 赵丹枫
  • 作者简介:黄冬梅,女,1964年生,教授,研究方向:数据挖掘.E-mail: dmhuang@shou.edu.cn
  • 基金资助:
    国家自然科学基金(61272098)资助.

Tabular-oriented data model and its query issues

HUANG Dongmei, SUN Le, SHI Shaohua, SU Cheng, ZHAO Danfeng   

  1. 1. School of Information, Shanghai Ocean University, Shanghai 201306, China; 2. East Sea Forecast Center of Oceanic Administration of China, Shanghai 200136, China
  • Received:2015-08-27 Revised:2015-09-29 Accepted:2015-09-29 Online:2015-09-29 Published:2015-09-29

摘要: 信息化的发展使得数据存储及表示形式呈现出分布性、异构性的特点,不仅包括关系数据库、面向对象数据库等传统结构化数据,还包括Excel、CSV等不具有明确结构的特殊非结构化数据等,与此同时,其数据呈现了量大、更新快、可用性弱等大数据特点.然而使用无结构和半结构化文档组织和管理Excel等表单数据,存在着数据弱可控、弱可用、及访问效率差的问题.针对该类问题,本文以Excel文本为数据源,提出了一种新的面向Tabular库的关系数据模型并讨论了其上的查询及优化问题.首先,给出了Tabular表单数据的形式化定义,其次,设计PartiPath划分树实现表格的关系划分及结构转换,在关系模型的基础上,给出其数据模型及数据模式,再者,定义了表单数据上的基本查询问题及融合用户兴趣指数改进查询相似度指标,最后给出实验分析并作出总结.

关键词: Tabular库, 查询, 数据模型, PartiPath划分树, 关系模型

Abstract: With the rapid development of information technologies, data storage and representation of various sources, including not only the traditional structured data such as relational databases and object-oriented databases, but also those special unstructured data like Excel, CSV documents, manifest distributed and heterogeneous characteristics. Undoubtedly, all above data features high-volume, continuously-updating, low-usability, which falls into Big Data. However, the organization and management of Excel and other forms of data by using unstructured and semi-structured methods leads to a weakly-controllable, weakly-usable data structure with poor access efficiency. To solve this problem, this paper, taking Excel data source into consideration, aims to propose a new tabular-oriented relational data model and discusses Tabular querying and optimizing issues. Firstly, the formal definition of Tabular form data is given; secondly, PartiPath tree is designed to achieve structural transformation by tabular division and its relation schema as well; then its data model is presented. After that, four basic queries and their optimization by improved DICE with user interest similarity are described. Finally, the experiment was conducted and a conclusion was drawm.

Key words: Tabular repository, query, data model, PartiPath tree, relation model

中图分类号: