A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection

Similar documents
Multi-Task Minimum Error Rate Training for SMT

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

3.2 Differentiability

Discriminative Training

Classifying Biomedical Text Abstracts based on Hierarchical Concept Structure

Section 7.1: Integration by Parts

Math 1B, lecture 8: Integration by parts

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs

Lower bounds on Locality Sensitive Hashing

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

Two formulas for the Euler ϕ-function

The proper definition of the added mass for the water entry problem

CUSTOMER REVIEW FEATURE EXTRACTION Heng Ren, Jingye Wang, and Tony Wu

Integration Review. May 11, 2013

Comparing Three Plagiarism Tools (Ferret, Sherlock, and Turnitin)

Higher. Further Calculus 149

Sequence Comparison: Local Alignment. Genome 373 Genomic Informatics Elhanan Borenstein

A Neural Passage Model for Ad-hoc Document Retrieval

Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance

inflow outflow Part I. Regular tasks for MAE598/494 Task 1

Implicit Differentiation

Calculus of Variations

Unit #6 - Families of Functions, Taylor Polynomials, l Hopital s Rule

On the enumeration of partitions with summands in arithmetic progression

Axiometrics: Axioms of Information Retrieval Effectiveness Metrics

CMSC 313 Preview Slides

Final Exam Study Guide and Practice Problems Solutions

Latent Dirichlet Allocation in Web Spam Filtering

Connecting Algebra to Calculus Indefinite Integrals

Construction of the Electronic Radial Wave Functions and Probability Distributions of Hydrogen-like Systems

Exam 2 Answers Math , Fall log x dx = x log x x + C. log u du = 1 3

7.1 Support Vector Machine

Multi-View Clustering via Canonical Correlation Analysis

Discriminative Training. March 4, 2014

Optimal Variable-Structure Control Tracking of Spacecraft Maneuvers

Decoding Revisited: Easy-Part-First & MERT. February 26, 2015

Multiple System Combination. Jinhua Du CNGL July 23, 2008

Evaluation. Brian Thompson slides by Philipp Koehn. 25 September 2018

TnT Part of Speech Tagger

Similarity Measures for Categorical Data A Comparative Study. Technical Report

Optimization of Geometries by Energy Minimization

Chapter 3: Basics of Language Modelling

Language Processing with Perl and Prolog

Prenominal Modifier Ordering via MSA. Alignment

UNIFYING PCA AND MULTISCALE APPROACHES TO FAULT DETECTION AND ISOLATION

12.11 Laplace s Equation in Cylindrical and

Algorithms: COMP3121/3821/9101/9801

All s Well That Ends Well: Supplementary Proofs

Discovering Frequent Sets from Data Streams with CPU Constraint

A Path Planning Method Using Cubic Spiral with Curvature Constraint

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

WJEC Core 2 Integration. Section 1: Introduction to integration

The Wiener Index of Trees with Prescribed Diameter

Solutions to Math 41 Second Exam November 4, 2010

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Lenny Jones Department of Mathematics, Shippensburg University, Shippensburg, Pennsylvania Daniel White

Simple Method Based on Complexity for Authorship Detection of Text

Improved Decipherment of Homophonic Ciphers

Lecture Note 2. 1 Bonferroni Principle. 1.1 Idea. 1.2 Want. Material covered today is from Chapter 1 and chapter 4

Image Denoising Using Spatial Adaptive Thresholding

Inverse Functions. Review from Last Time: The Derivative of y = ln x. [ln. Last time we saw that

Machine Translation Evaluation

Module FP2. Further Pure 2. Cambridge University Press Further Pure 2 and 3 Hugh Neill and Douglas Quadling Excerpt More information

Code_Aster. Detection of the singularities and calculation of a map of size of elements

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Calculus in the AP Physics C Course The Derivative

Non-Linear Bayesian CBRN Source Term Estimation

Text Analytics (Text Mining)

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

SYDE 112, LECTURE 1: Review & Antidifferentiation

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Natural Language Processing. Statistical Inference: n-grams

1 The Derivative of ln(x)

Hamming Distance Kernelisation via Topological Quantum Computation

Math 342 Partial Differential Equations «Viktor Grigoryan

4.2 First Differentiation Rules; Leibniz Notation

Solutions to Practice Problems Tuesday, October 28, 2008

On the minimum distance of elliptic curve codes

Unsupervised Vocabulary Induction

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

Lab 12: Structured Prediction

Make graph of g by adding c to the y-values. on the graph of f by c. multiplying the y-values. even-degree polynomial. graph goes up on both sides

Lecture 1b: Text, terms, and bags of words

Analysis of techniques for coarse-to-fine decoding in neural machine translation

A check digit system over a group of arbitrary order

arxiv: v4 [cs.ds] 7 Mar 2014

Variable Independence and Resolution Paths for Quantified Boolean Formulas

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

Proof by Mathematical Induction.

VIRTUAL STRUCTURE BASED SPACECRAFT FORMATION CONTROL WITH FORMATION FEEDBACK

Section 2.7 Derivatives of powers of functions

Phrase-Based Statistical Machine Translation with Pivot Languages

Calculus I Sec 2 Practice Test Problems for Chapter 4 Page 1 of 10

CSC321 Lecture 15: Recurrent Neural Networks

Regular tree languages definable in FO and in FO mod

A new identification method of the supply hole discharge coefficient of gas bearings

On combinatorial approaches to compressed sensing

Systems & Control Letters

Transcription:

A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection

Introuction Plagiarism: Unauthorize use of Text, coe, iea, Plagiarism etection research area has receive increasing attention The rapi growth of ocuments in ifferent languages Increase accessibility of electronic ocuments 29/1/2017 2

Prototypical Plagiarism Monolingual: Cross-Language: copy or paraphrase inclues translation 29/1/2017 3

Problem efinition has two steps Caniate ocument retrieval D: set of source ocuments : suspicious ocument with fragments f Pairwise ocument similarity : source ocument with fragments f : suspicious ocument with fragments f } ), (,, { ), ( f f f f Sim D D ocuments Caniate } ), (,,, { ), ( f f f f f f Sim pairs Copie 29/1/2017 4

Detaile analysis in a pair of ocuments Possible errors in etecting plagiarism: Text that is not plagiarize might be erroneously reporte Part or whole of plagiarize source or target text might be unreporte Parts of one plagiarism case might be reporte as separate cases 29/1/2017 5

Evaluation Metrics S: set of true plagiarism cases, R: set of etections reporte Precision( R, S) 1 R rr ss ( s r r) Recall( R, S) 1 S ss rr ( s s r) Fraction of reporte etections (at character level) that are truly plagiarize Fraction of plagiarism cases (at character level) that are etecte Granularity(R, S) = 1 S R å sîs R R s Average number of reporte etections per etecte plagiarism case Plaget( R, S) Combine metric log 2 F1 ( R, S) (1 Granularity( R, S)) 29/1/2017 6

Two phase algorithm for ientifying plagiarize text fragments Caniate sentence selection: Fins many possibly plagiarize fragments Focusing on recall Result filtering: Fins alignments between the ientifie passages Focusing on precision 29/1/2017 7

Step 1: Caniate Sentence Selection 29/1/2017 8

Token Extraction Source () Obtain fragments f f Obtain fragments Suspicious ( ) Seeing: Token extraction, Each fragment is create from a sequence of k Using all wors or keywors consecutive sentences using a sliing winow Representative wors K 1 K n K 1 K n 29/1/2017 9

Create vector Create vector Token Extraction Check existence of items in fragments Check existence of items in fragments Source () Obtain fragments f f Obtain fragments Suspicious ( ) Match merging: Two etecte fragments are Ientify Use Cosine presence Similarity of representative terms merge to report a single plagiarism case if the number of characters between those fragments in the source an suspicious ocuments are both below a proximity threshol Representative wors K 1 K n K 1 K n sim( f, K 1 ) sim( f, K 1 ) Similarity computation 29/1/2017 sim( sim( 10 f, K n ) f, K n )

Create vector Create vector Token Extraction Check existence of items in fragments Check existence of items in fragments Source () Obtain fragments f f Obtain fragments Suspicious ( ) Cross-lingual plagiarism etection Representative wors K 1 K n Translate K 1 K n sim( f, K 1 ) sim( f, K 1 ) Similarity computation 29/1/2017 sim( sim( 11 f, K n ) f, K n )

Step 2: Result Filtering 29/1/2017 12

Aligning segments within fragment pairs Fragment pair from the first step retrieve Fragments split into smaller segments Segments aligne using a ynamic programming algorithm allowing 1:0, 0:1, 1:1, 2:1, 1:2, 3:1 an 1:3 alignments exclue sentences at start or en of fragment with >50% content in 1:0 or 0:1 alignments f 29/1/2017 13 f

Alignment etails where S(i, j) represents the score of the optimal alignment from the beginning of the fragment to the i th suspicious segment an the j th source segment To penalize 1-0 an 0-1 alignments an also to make all scores comparable, we keep track of the number of alignments obtaine so far, an the score in each step is normalize by the number of alignments 29/1/2017 14

Granularity level of alignment Sentence level: Using sentences as the granularity level of alignment n-gram level: A plagiarize fragment may omit pieces from the source, but it is likely that at least some of the smallest units are preserve n is the expecte number of terms in each segment 29/1/2017 15

Results Result of etaile analysis sub-task using PersinaPlaget2016 training corpus t = Similarity threshol, n=number of sentences Precision Recall Granularity Plaget (t = 02, n = 5) 04004 08151 1 05370 (t = 03, n = 5) 07630 07486 1 07557 (t = 04, n = 5) 08532 05357 1 06582 (t = 03, n = 3) 07867 08304 1 08080 (t = 04, n = 3) 08604 06567 1 07449 Result of etaile analysis sub-task using PersinaPlaget2016 test corpus Precision Recall Granularity Plaget Runtime (t = 03, n = 5) 07496 07050 1 07266 00:24:08 29/1/2017 16

Results Evaluation of the secon phase, result filtering step: t = 03, n = 3 Precision Recall Granularity Plaget Without result filtering 06029 08602 1 07087 After result filtering 07867 08304 1 08080 29/1/2017 17

Results Evaluation of the seeing phase, using keywors: Precision Recall Granularity Plaget (t = 03, n = 3) 05118 08858 1 06487 (t = 04, n = 3) 06431 08928 1 07476 (t = 05, n = 3) 07475 08862 1 08110 (t = 06, n = 3) 08117 08459 1 08282 (t = 07, n = 3) 08522 07531 1 07996 29/1/2017 18

Cross-lingual etaile analysis for plagiarism etection Ehsan, N, Tompa, FW, Shakery, A: Using a ictionary an n-gram alignment to improve fine-graine cross-language plagiarism etection In: Proceeings of the 2016 ACM Symposium on Document Engineering pp 59 68 ACM (2016) Precision Recall Granularity Plaget Using PAN2012 English-German ataset 09301 08193 1 08712 29/1/2017 19

Summary The propose metho is a two phase approach for ientifying plagiarize fragments The first phase tries to fin possibly plagiarize fragments The secon phase tries to improve the precision metric The framework is applicable in any language The approach coul be aapte for cross language omain 29/1/2017 20

Thanks for your attention 29/1/2017 21