JOURNAL ARTICLE

Citation‐based plagiarism detection: Practicability on a large‐scale scientific corpus

Béla GippNorman MeuschkeCorinna Breitinger

Year: 2014 Journal:   Journal of the Association for Information Science and Technology Vol: 65 (8)Pages: 1527-1540   Publisher: Wiley

Abstract

The automated detection of plagiarism is an information retrieval task of increasing importance as the volume of readily accessible information on the web expands. A major shortcoming of current automated plagiarism detection approaches is their dependence on high character‐based similarity. As a result, heavily disguised plagiarism forms, such as paraphrases, translated plagiarism, or structural and idea plagiarism, remain undetected. A recently proposed language‐independent approach to plagiarism detection, C itation‐based P lagiarism D etection ( CbPD ), allows the detection of semantic similarity even in the absence of text overlap by analyzing the citation placement in a document's full text to determine similarity. This article evaluates the performance of CbPD in detecting plagiarism with various degrees of disguise in a collection of 185,000 biomedical articles. We benchmark CbPD against two character‐based detection approaches using a ground truth approximated in a user study. Our evaluation shows that the citation‐based approach achieves superior ranking performance for heavily disguised plagiarism forms. Additionally, we demonstrate CbPD to be computationally more efficient than character‐based approaches. Finally, upon combining the citation‐based with the traditional character‐based document similarity visualization methods in a hybrid detection prototype, we observe a reduction in the required user effort for document verification.

Keywords:
Plagiarism detection Computer science Information retrieval Citation Similarity (geometry) Character (mathematics) Ranking (information retrieval) Task (project management) Benchmark (surveying) Natural language processing Visualization Semantic similarity Artificial intelligence World Wide Web Image (mathematics)

Metrics

42
Cited By
10.16
FWCI (Field Weighted Citation Impact)
22
Refs
0.98
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Academic integrity and plagiarism
Social Sciences →  Social Sciences →  Safety Research
Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Authorship Attribution and Profiling
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.