Cloud-based infrastructures enable applications to collect and analyze massive amounts of data. Whether these applications are newly developed or they are being evolved from existing RDBMS-based implementations, NoSQL databases offer an attractive platform with which to address this challenge. However, developers find it difficult to effectively manage data in NoSQL databases, because these platforms do not offer much support for data organization. Since poor data organization may abuse the features of the NoSQL database and result in unsatisfactory performance, developing a systematic method for NoSQL database data-schema design is a timely and important problem. In this paper, we focus on geospatial applications, as a family of big-data systems with distinct data types and usage patterns, in need of scalability. We propose the HGrid data model for HBase, based on a hybrid index structure, combining a quad-tree and a regular grid as primary and secondary indices correspondingly. We have comparatively evaluated the performance of HGrid with uniform and skewed data, against two other data models based on quad-tree and regular-grid indices. Our results demonstrate that HGrid scales well and supports efficient performance for range and k-nearest neighbor queries. Although this model does not outperform all its competitors in terms of query response time, it is more flexible for discontinuous and skewed space, and its index requires less space than the corresponding quad-tree and regular-grid indices, which makes its deployment possible with less resources. Through this study, we also formulate a set of guidelines on how to organize data for geospatial applications in HBase.
Fan GaoPeng YueZhaoyan WuMingda Zhang
Daniel A. KeimChristian PanseMike SipsStephen C. North
Arménio AntunesMaribel Yasmina SantosAdriano Moreira