A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection

Introuction Plagiarism: Unauthorize use of Text, coe, iea, Plagiarism etection research area has receive increasing attention The rapi growth of ocuments in ifferent languages Increase accessibility of electronic ocuments 29/1/2017 2

Prototypical Plagiarism Monolingual: Cross-Language: copy or paraphrase inclues translation 29/1/2017 3

Problem efinition has two steps Caniate ocument retrieval D: set of source ocuments : suspicious ocument with fragments f Pairwise ocument similarity : source ocument with fragments f : suspicious ocument with fragments f } ), (,, { ), ( f f f f Sim D D ocuments Caniate } ), (,,, { ), ( f f f f f f Sim pairs Copie 29/1/2017 4

Detaile analysis in a pair of ocuments Possible errors in etecting plagiarism: Text that is not plagiarize might be erroneously reporte Part or whole of plagiarize source or target text might be unreporte Parts of one plagiarism case might be reporte as separate cases 29/1/2017 5

Evaluation Metrics S: set of true plagiarism cases, R: set of etections reporte Precision( R, S) 1 R rr ss ( s r r) Recall( R, S) 1 S ss rr ( s s r) Fraction of reporte etections (at character level) that are truly plagiarize Fraction of plagiarism cases (at character level) that are etecte Granularity(R, S) = 1 S R å sîs R R s Average number of reporte etections per etecte plagiarism case Plaget( R, S) Combine metric log 2 F1 ( R, S) (1 Granularity( R, S)) 29/1/2017 6

Two phase algorithm for ientifying plagiarize text fragments Caniate sentence selection: Fins many possibly plagiarize fragments Focusing on recall Result filtering: Fins alignments between the ientifie passages Focusing on precision 29/1/2017 7

Step 1: Caniate Sentence Selection 29/1/2017 8

Token Extraction Source () Obtain fragments f f Obtain fragments Suspicious ( ) Seeing: Token extraction, Each fragment is create from a sequence of k Using all wors or keywors consecutive sentences using a sliing winow Representative wors K 1 K n K 1 K n 29/1/2017 9

Create vector Create vector Token Extraction Check existence of items in fragments Check existence of items in fragments Source () Obtain fragments f f Obtain fragments Suspicious ( ) Match merging: Two etecte fragments are Ientify Use Cosine presence Similarity of representative terms merge to report a single plagiarism case if the number of characters between those fragments in the source an suspicious ocuments are both below a proximity threshol Representative wors K 1 K n K 1 K n sim( f, K 1 ) sim( f, K 1 ) Similarity computation 29/1/2017 sim( sim( 10 f, K n ) f, K n )

Create vector Create vector Token Extraction Check existence of items in fragments Check existence of items in fragments Source () Obtain fragments f f Obtain fragments Suspicious ( ) Cross-lingual plagiarism etection Representative wors K 1 K n Translate K 1 K n sim( f, K 1 ) sim( f, K 1 ) Similarity computation 29/1/2017 sim( sim( 11 f, K n ) f, K n )

Step 2: Result Filtering 29/1/2017 12

Aligning segments within fragment pairs Fragment pair from the first step retrieve Fragments split into smaller segments Segments aligne using a ynamic programming algorithm allowing 1:0, 0:1, 1:1, 2:1, 1:2, 3:1 an 1:3 alignments exclue sentences at start or en of fragment with >50% content in 1:0 or 0:1 alignments f 29/1/2017 13 f

Alignment etails where S(i, j) represents the score of the optimal alignment from the beginning of the fragment to the i th suspicious segment an the j th source segment To penalize 1-0 an 0-1 alignments an also to make all scores comparable, we keep track of the number of alignments obtaine so far, an the score in each step is normalize by the number of alignments 29/1/2017 14

Granularity level of alignment Sentence level: Using sentences as the granularity level of alignment n-gram level: A plagiarize fragment may omit pieces from the source, but it is likely that at least some of the smallest units are preserve n is the expecte number of terms in each segment 29/1/2017 15

Results Result of etaile analysis sub-task using PersinaPlaget2016 training corpus t = Similarity threshol, n=number of sentences Precision Recall Granularity Plaget (t = 02, n = 5) 04004 08151 1 05370 (t = 03, n = 5) 07630 07486 1 07557 (t = 04, n = 5) 08532 05357 1 06582 (t = 03, n = 3) 07867 08304 1 08080 (t = 04, n = 3) 08604 06567 1 07449 Result of etaile analysis sub-task using PersinaPlaget2016 test corpus Precision Recall Granularity Plaget Runtime (t = 03, n = 5) 07496 07050 1 07266 00:24:08 29/1/2017 16

Results Evaluation of the secon phase, result filtering step: t = 03, n = 3 Precision Recall Granularity Plaget Without result filtering 06029 08602 1 07087 After result filtering 07867 08304 1 08080 29/1/2017 17

Results Evaluation of the seeing phase, using keywors: Precision Recall Granularity Plaget (t = 03, n = 3) 05118 08858 1 06487 (t = 04, n = 3) 06431 08928 1 07476 (t = 05, n = 3) 07475 08862 1 08110 (t = 06, n = 3) 08117 08459 1 08282 (t = 07, n = 3) 08522 07531 1 07996 29/1/2017 18

Cross-lingual etaile analysis for plagiarism etection Ehsan, N, Tompa, FW, Shakery, A: Using a ictionary an n-gram alignment to improve fine-graine cross-language plagiarism etection In: Proceeings of the 2016 ACM Symposium on Document Engineering pp 59 68 ACM (2016) Precision Recall Granularity Plaget Using PAN2012 English-German ataset 09301 08193 1 08712 29/1/2017 19

Summary The propose metho is a two phase approach for ientifying plagiarize fragments The first phase tries to fin possibly plagiarize fragments The secon phase tries to improve the precision metric The framework is applicable in any language The approach coul be aapte for cross language omain 29/1/2017 20

Thanks for your attention 29/1/2017 21