Companies with limited budgets must decide how best to defend against cyber threats. Therefore, it is necessary to develop techniques that can group together these threats into high and low risk categories, based on observable attributes. High risk threats would receive a high budget allocation, and low risk threats could simply be ignored. Determining the dividing line between a high and low risk threat is a fundamental problem in cyber security. Fortunately, data mining techniques such as cluster analysis can be used to automatically members to categories. In this study, we examine whether clustering can be used to assign threats to high and low risk categories in the case of copyright infringement complaints, where lawyers are routinely engaged by rightsholders to issue takedown notices under the Digital Millennium Copyight Act (DMCA), which is a significant expense. Rather than issuing a takedown notice for every site, it would be more cost effective to only issue notices to the highest risk sites. In this paper, we use cluster analysis to group together high and low income piracy websites, based on a range of attributes, such as income earned per day, and estimated worth. It was found that there were two natural clusters of the most complained about sites (high income and low income). This means that rightsholders should focus their efforts and resources on only high income sites, and ignore the others. It was also found that the main significant factors or key critical variables for separating high risk vs low risk piracy websites included daily page-views, number of internal and external links, social media shares (i.e. social network engagement) and element of the page structure, including HTML page and JavaScript sizes. The broader implications for automating the assignment of risk to cyber security events is discussed.
Liqaa NawafVibhushinie Bentotahewa
Denise FerebeeDipankar DasguptaMichael SchmidtQishi Wu