Please use this identifier to cite or link to this item:
|Title:||Plagiarism Detection: A focus on the Intrinsic Approach and the Evaluation in the Arabic Language||Authors:||Bensalem, Imene||Affiliations:||Faculty of Information and Communication Technology (ICT)||Keywords:||Intrinsic plagiarism detection;Evaluation corpora;Stylistic analysis;Character n-grams;Arabic plagiarism detection||Date:||1-Feb-2020||Abstract:||
With the advent of the Internet and the widespread use of digital documents, access to information from the four corners of the globe has become easier and easier. This was accompanied by the copy-paste phenomenon that curtailed the appropriation of others’ work (i.e., plagiarism) to a few clicks. Since the ‘70s of the last century, researchers have begun developing software to automatically detect textual plagiarism. Still, as the techniques of these programs evolve, the plagiarists develop their tactics to escape them. Therefore, the plagiarism detection tools that have the potential to resist are the ones that are able to fight against this misconduct in different ways. Moreover, in the wake of globalisation, these tools should be able also to handle documents in multiple languages. Thus, given the perpetuation of this problem, the acquisition of the latest plagiarism detection technologies has become like an arms race for a never-ending battle
This thesis deals with two major topics: plagiarism detection in Arabic documents, and plagiarism detection based on the writing style changes in the suspicious document, which is called intrinsic plagiarism detection. This approach is an alternative to the text-matching approach, notably, in the absence of the plagiarism source. Our key contributions in these two areas lie first, in the development of Arabic corpora to allow for the evaluation of plagiarism detection software on this language and, second, in the development of a language-independent intrinsic plagiarism detection method that exploits the character n-grams in a machine learning approach while avoiding the curse of dimensionality. Hence, our third key contribution is an investigation on which character n-grams, in terms of their frequency and length, are the best to detect plagiarism intrinsically. We carried out our experiments on standardised English corpora and also on the developed Arabic corpora using the method we developed and one of the most prominent intrinsic plagiarism detection methods. The findings of our analysis can be exploited by the future intrinsic plagiarism detection methods that use character n-grams. In addition to the above-mentioned technical contributions, we provide the reader with comprehensive and critical surveys of the literature of Arabic plagiarism detection and intrinsic plagiarism detection, which were lacking in both topics.
|Appears in Collections:||Electronic Theses and Dissertation|
checked on Jan 29, 2021
checked on Jan 29, 2021
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.