Similarity search and similarity join are two important operations in text databases. Filter-and-verify framework aims to reduce the comparison time by filtering out some pairs of texts before actually comparing the remaining pairs. Many filter methods do not take into account the repetition of the query words over time. A query which is frequently repeated over a time period is called a high-frequency query. High-frequency-queries-based filter is a filter method that deals with this type of queries. The performance of this method depends on the choice of high-frequency queries. This paper proposes methods to find the set of high-frequency queries from the given query set. One method is to use DBSCAN and the other is to use DBSCAN with merging strategy, called DBSM. The experimental results show that both DBSCAN and DBSM can find high-frequency queries, but the set of high-frequency queries obtained from DBSM gives higher the pruning power for high-frequency-queries-based filter.
Jaruloj ChongstitvatanaNatthee Thitinanrungkit
Kamolwan KunanusontJaruloj Chongstitvatana
Kamolwan KunanusontJaruloj Chongstitvatana
Man YuShupeng HanYale ChaiYing ZhangYanlong Wen