DISSERTATION

FINDING SETS OF HIGH-FREQUENCY QUERIES FOR HIGH-FREQUENCY-QUERY-BASED FILTER FOR SIMILARITY JOIN

Abstract

Similarity search and similarity join are important operations in text databases. Similarity search finds all records which are similar to the given text query while similarity join matches pairs of similar records from two relations. In some situations, some similar queries are repeated over a period of time. These queries are called high-frequency queries. High-frequency-query-based filter is used to facilitate this type of queries. This method uses an index structure called similarity table to prune non-related text records in relations. A similarity table is created based on a chosen high-frequency query obtained from the query set. However, the performance of this filter method depends mostly on these chosen queries. This thesis proposes a method to find high-frequency queries for the high-frequency-query-based filter. The proposed method is based on a density-based cluster analysis, called DBSCAN, to capture the main characteristics of the query set by grouping them and find the representative points from each group. Two methods – DBRAN and DBSM - to deal with redundant high-frequency queries are proposed. DBRAN finds clusters high-frequency queries, by DBSCAN, and randomly chooses one high-frequency query from a cluster as a representative. DBSM also uses DBSCAN to finds clusters, and repeatedly merge the queries in these clusters until it cannot give any improvement on similarity tables. For evaluation, the proposed method is applied on various sets of queries to find high-frequency queries for three datasets. It is found that DBSM performs better than DBRAN when the similarity between high-frequency queries is low. However, when the similarity between high-frequencies is high, the performance of both DBRAN and DBSM are about the same.

Keywords:
Join (topology) Similarity (geometry) Computer science Filter (signal processing) Information retrieval Query optimization Data mining Mathematics Artificial intelligence Combinatorics

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
0
Refs
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Data Management and Algorithms
Physical Sciences →  Computer Science →  Signal Processing
Advanced Database Systems and Queries
Physical Sciences →  Computer Science →  Computer Networks and Communications
Time Series Analysis and Forecasting
Physical Sciences →  Computer Science →  Signal Processing
© 2026 ScienceGate Book Chapters — All rights reserved.