Collocation Extraction Using Square Mutual Information Approaches. Received December 2010; revised January 2011

Similar documents
Application of Calibration Approach for Regression Coefficient Estimation under Two-stage Sampling Design

Introduction to local (nonparametric) density estimation. methods

Estimation of Stress- Strength Reliability model using finite mixture of exponential distributions

Bayes Estimator for Exponential Distribution with Extension of Jeffery Prior Information

A New Family of Transformations for Lifetime Data

Comparison of Parameters of Lognormal Distribution Based On the Classical and Posterior Estimates

Analysis of Lagrange Interpolation Formula

Comparison of Dual to Ratio-Cum-Product Estimators of Population Mean

PTAS for Bin-Packing

Comparing Different Estimators of three Parameters for Transmuted Weibull Distribution

Non-uniform Turán-type problems

On the Interval Zoro Symmetric Single Step. Procedure IZSS1-5D for the Simultaneous. Bounding of Real Polynomial Zeros

Solving Constrained Flow-Shop Scheduling. Problems with Three Machines

PROJECTION PROBLEM FOR REGULAR POLYGONS

Functions of Random Variables

Section l h l Stem=Tens. 8l Leaf=Ones. 8h l 03. 9h 58

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

Analysis of Variance with Weibull Data

A Robust Total Least Mean Square Algorithm For Nonlinear Adaptive Filter

A Note on Ratio Estimators in two Stage Sampling

For combinatorial problems we might need to generate all permutations, combinations, or subsets of a set.

Bounds for the Connective Eccentric Index

(Monte Carlo) Resampling Technique in Validity Testing and Reliability Testing

Chapter 8. Inferences about More Than Two Population Central Values

Beam Warming Second-Order Upwind Method

Bounds on the expected entropy and KL-divergence of sampled multinomial distributions. Brandon C. Roy

f f... f 1 n n (ii) Median : It is the value of the middle-most observation(s).

Econometric Methods. Review of Estimation

Packing of graphs with small product of sizes

A Chunking Strategy Towards Unknown Word Detection in Chinese Word Segmentation

MEASURES OF DISPERSION

On Modified Interval Symmetric Single-Step Procedure ISS2-5D for the Simultaneous Inclusion of Polynomial Zeros

A New Measure of Probabilistic Entropy. and its Properties

Chapter 3 Sampling For Proportions and Percentages

2006 Jamie Trahan, Autar Kaw, Kevin Martin University of South Florida United States of America

Research on SVM Prediction Model Based on Chaos Theory

CHAPTER VI Statistical Analysis of Experimental Data

A Method for Damping Estimation Based On Least Square Fit

MULTIDIMENSIONAL HETEROGENEOUS VARIABLE PREDICTION BASED ON EXPERTS STATEMENTS. Gennadiy Lbov, Maxim Gerasimov

Multiple Choice Test. Chapter Adequacy of Models for Regression

A Combination of Adaptive and Line Intercept Sampling Applicable in Agricultural and Environmental Studies

Q-analogue of a Linear Transformation Preserving Log-concavity

Sufficiency in Blackwell s theorem

Median as a Weighted Arithmetic Mean of All Sample Observations

Bayesian Inferences for Two Parameter Weibull Distribution Kipkoech W. Cheruiyot 1, Abel Ouko 2, Emily Kirimi 3

Convergence of the Desroziers scheme and its relation to the lag innovation diagnostic

L5 Polynomial / Spline Curves

1. The weight of six Golden Retrievers is 66, 61, 70, 67, 92 and 66 pounds. The weight of six Labrador Retrievers is 54, 60, 72, 78, 84 and 67.

Arithmetic Mean and Geometric Mean

hp calculators HP 30S Statistics Averages and Standard Deviations Average and Standard Deviation Practice Finding Averages and Standard Deviations

NP!= P. By Liu Ran. Table of Contents. The P versus NP problem is a major unsolved problem in computer

Bayes Interval Estimation for binomial proportion and difference of two binomial proportions with Simulation Study

Study of Correlation using Bayes Approach under bivariate Distributions

9.1 Introduction to the probit and logit models

1. BLAST (Karlin Altschul) Statistics

b. There appears to be a positive relationship between X and Y; that is, as X increases, so does Y.

We have already referred to a certain reaction, which takes place at high temperature after rich combustion.

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

Lecture Notes Types of economic variables

Generating Multivariate Nonnormal Distribution Random Numbers Based on Copula Function

ENGI 3423 Simple Linear Regression Page 12-01

3. Basic Concepts: Consequences and Properties

The TDT. (Transmission Disequilibrium Test) (Qualitative and quantitative traits) D M D 1 M 1 D 2 M 2 M 2D1 M 1

Extract Domain-specific Paraphrase from Monolingual Corpus for Automatic Evaluation of Machine Translation

NP!= P. By Liu Ran. Table of Contents. The P vs. NP problem is a major unsolved problem in computer

A tighter lower bound on the circuit size of the hardest Boolean functions

UNIT 2 SOLUTION OF ALGEBRAIC AND TRANSCENDENTAL EQUATIONS

Bootstrap Method for Testing of Equality of Several Coefficients of Variation

ANALYSIS ON THE NATURE OF THE BASIC EQUATIONS IN SYNERGETIC INTER-REPRESENTATION NETWORK

C-1: Aerodynamics of Airfoils 1 C-2: Aerodynamics of Airfoils 2 C-3: Panel Methods C-4: Thin Airfoil Theory

v 1 -periodic 2-exponents of SU(2 e ) and SU(2 e + 1)

Unimodality Tests for Global Optimization of Single Variable Functions Using Statistical Methods

Latent Semantic Indexing Based on Factor Analysis

5 Short Proofs of Simplified Stirling s Approximation

Lecture 1. (Part II) The number of ways of partitioning n distinct objects into k distinct groups containing n 1,

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. Research on scheme evaluation method of automation mechatronic systems

to the estimation of total sensitivity indices

Block-Based Compact Thermal Modeling of Semiconductor Integrated Circuits

02/15/04 INTERESTING FINITE AND INFINITE PRODUCTS FROM SIMPLE ALGEBRAIC IDENTITIES

The Selection Problem - Variable Size Decrease/Conquer (Practice with algorithm analysis)

Evaluating Polynomials

BAYESIAN INFERENCES FOR TWO PARAMETER WEIBULL DISTRIBUTION

Some Applications of the Resampling Methods in Computational Physics

It is Advantageous to Make a Syllabus as Precise as Possible: Decision-Theoretic Analysis

ESTIMATION OF MISCLASSIFICATION ERROR USING BAYESIAN CLASSIFIERS

Assignment 5/MATH 247/Winter Due: Friday, February 19 in class (!) (answers will be posted right after class)

Entropy ISSN by MDPI

UNIVERSITY OF EAST ANGLIA. Main Series UG Examination

2. Independence and Bernoulli Trials

Lecture 9: Tolerant Testing

Analysis of System Performance IN2072 Chapter 5 Analysis of Non Markov Systems

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

STK4011 and STK9011 Autumn 2016

On Fuzzy Arithmetic, Possibility Theory and Theory of Evidence

Chapter 8: Statistical Analysis of Simulated Data

Part 4b Asymptotic Results for MRR2 using PRESS. Recall that the PRESS statistic is a special type of cross validation procedure (see Allen (1971))

Fourth Order Four-Stage Diagonally Implicit Runge-Kutta Method for Linear Ordinary Differential Equations ABSTRACT INTRODUCTION

Chapter 5. Curve fitting

9 U-STATISTICS. Eh =(m!) 1 Eh(X (1),..., X (m ) ) i.i.d

CHAPTER 2. = y ˆ β x (.1022) So we can write

Transcription:

Iteratoal Joural of Kowledge www.jklp.org ad Laguage Processg KLP Iteratoal c2011 ISSN 2191-2734 Volume 2, Number 1, Jauary 2011 pp. 53-58 Collocato Extracto Usg Square Mutual Iformato Approaches Huaru Zhag 1, Yogwe Zhag 2 ad Jgsog Yu 3 1 Isttute of Computatoal Lgustcs Pekg Uversty Bejg, Cha hrzhag@pku.edu.c 2,3 School of Software ad Mcroelectrocs Pekg Uversty Bejg, Cha 2 zhagywbb@gmal.com, 3 yjs@ss.pku.edu.c Receved December 2010; revsed Jauary 2011 ABSTRACT. MI (Mutual Iformato has bee proposed for measure of collocato log before, although stll wdely appled today varous felds, t has the dsadvatage of heavly favorg rarely occurrg tems. A ew mproved Square Mutual Iformato approach s proposed to solve ths problem. Supported by expermetal results, the precso of ths ew method s better tha that of MI ad other modfed approach such as combato of exteral ad teral measures. Aother advatage of ths ew approach s that t remas laguage depedet. Keywords: Collocato, assocato measure, square mutual formato, mproved square mutual formato 1. Itroducto. Statstcal approach of collocato extracto has bee a domat tred for years, from [4, 9, 6] to [5, 7, 1]. Mutual Iformato (MI s oe of most early ad wdely used measures, referred the by the majorty of research papers o collocato extracto. I [8], a total of 82 assocato measures are emprcally tested, 6 amog whch are mutual formato ad derved measures. However, the ew approach proposed ths paper s ot foud the full lst. Our ma terest les o the mprovemet of mutual formato related measures. Oe tutoal motvato s that mutual formato s orgated from formato theory, whle may formato-theoretc approaches have bee qute successful NLP. Aother motvato from the opposte drecto s that mutual formato s sometmes cosdered as a poor measure for collocato extracto. Despte the dsadvatage of heavly favorg rarely occurrg tems, we thk that MI ca be mproved to get better performace. We wll frst revew oe of such attempt to modfy MI [2, 3].

2. Uthood: Che s approach. Che [2, 3] calculates uthood measure by combg the exteral measure ad the teral measure. The exteral measure s based o two rates: the left depedet rate (LD ad the rght depedet rate (RD. max f ( aw1 w a A LD( w1 w f ( w w max f ( w1 wb b B RD( w1 w f ( w w where w = w 1 w 2 w f(w s the frequecy of a strg w, A s the full set of all the left eghbor elemets of w, a s ay elemet of set A, B s the full set of all rght eghbor elemets of w, b s ay elemet of set B. The exteral measure, deoted as IDR (depedet rate, s gve by. IDR( w.. w (1 1/ f ( w.. w (1 LD( w.. w (1 RD( w.. w (3 1 1 1 1 The teral measure s based o CoectRate(w w +1, whch s gve by CoectRat e( w w 1 p( w w 1 p( w w 1 1 p( w p( w The mmum of CoectRate(w w +1, deoted as MCoectRate(w 1..w, s the teral measure. 1 1 MCoectRate ( w.. w m CoectRate ( w w 1 1 1 1 The fal formula of uthood measure, deoted as UtRate(w 1..w, s the product of exteral measure IDR(w 1..w ad teral measure MCoectRate(w 1..w. UtRate ( w.. w IDR( w.. w MCoectRate ( w.. w 1 1 1 It ca be see that CoectRate(w w +1 s a trasformato of MI, whch ca be derved from MI drectly. Ths suggests that Che s approach also belogs to the famly of MI, wth whch we wll compare the results of our ew method. 3. Improved square mutual formato: New approach. We add a ew term to square MI, whch creases the fluece of hgh frequecy combatos by logarthmc scale. The bgram verso s gve by 54

2 f ( xy log (1 f ( xy SquareMI ( x, y log ( f ( x f ( y where x, y s the adjacet part of combato xy, f(x, f(y s the frequecy of part x, y, f(xy s the frequecy of combato xy. Whle the -gram verso s SquareMI w f ( w... w log (1 f ( w... w ( 1 1 1,..., w log ( where w = w 1 w 2 w, f(w s the frequecy of part w, f(w 1 w s the frequecy of combato w. 1 f( w 4. Results ad Dscusso. The evaluatos ad results are as below: The frst part of the evaluato data s the People s Daly Corpus (Jauary 1998 segmeted ad aotated by Isttute of Computatoal Lgustcs, Pekg Uversty. The secod part of the evaluato data s Facal Tmes (http://www.ftchese.com/, maly Chese text traslated from orgal Eglsh text. The evaluato s based o the followg assumpto: The coecto betwee collocatos ad words s smlar to that betwee words ad Chese characters. If a method s sutable for extractg words from Chese character combatos, the t s sutable for extractg collocatos from word combatos. TABLE 1. Comparso of precsos Number of collocatos Mutual Iformato(% Ut Rate(% Square MI(% Top 100 68.00 86.00 95.00 Top 500 69.60 87.58 88.18 Top 1000 66.70 81.60 87.20 Top 5000 63.02 67.34 76.10 Top 10000 58.46 58.75 64.75 Top 15000 53.29 53.55 57.32 Top 21296 47.92 49.15 50.26 The top 21296 terms are selected for evaluato, parallel wth Che s approach (deoted as UtRate hereafter for better comparablty, as show Table 1. The precso chages wth the umber of collocatos selected. As show Fgure 1, 2, ad 3, the horzotal axs s umber of collocatos (100 as a ut, whle the y-axs s precso. From Fgure 1 we ca see that our mproved square mutual formato approach s 55

better tha Che s method ad potwse mutual formato method. FIGURE 1. Comparso wth MI ad UtRate. I [2], Che s methods acheved hgher precso tha that by repeatg hs method. Oe cojecture s that preprocessg ad/or postprocessg are doe before/after the extracto. After we remove the word extracto result cotag Chese characters stop lst, the precso curve becomes Fgure 2. FIGURE 2. Comparso wth UtRate after flterg. From Fgure 2 we ca see that after the removal of words cotag Chese characters stop lst, Che s method get much closer result to our mproved square mutual formato method. Fgure 3 shows the chage precso curve of our mproved square mutual formato method before ad after the removal of words cotag stoppg Chese characters. The mor chage precso curve of our method suggests that our method ca do better eve before the use of flterg, whch meas our method s more effectve ad ca be laguage depedet. 56

(After (Before FIGURE 3. Improved Square MI (before ad after flterg. Expert Evaluato: A radomly-chose sample of the result s maually checked by huma experts, ad the approved percetage s show Table 2. TABLE 2. Comparso of expert evaluato Number of collocatos Ut Rate(% Square MI(% Top 100 82 84 Top 500 72 78 Top 1000 58 63 Top 3000 53 56 Top 5000 40 43 Top 10000 38 38 From these comparsos, we fd that our mproved square mutual formato approach obtas a better precso collocato extracto. 5. Coclusos. The ew mproved square mutual formato approach over performs potwse mutual formato method completely. Although smpler tha Che s approach, our approach s stll more effectve tha Che s whe o flter s appled. Huma evaluato o chose sample also cofrms the advatage of ths ew approach. Ackowledgmet. Ths work s partally based o the segmeted ad aotated Chese corpus developed by Isttute of Computatoal Lgustcs at Pekg Uversty uder the leadershp of Professor Shwe YU. 57

REFERENCES [1] I. A. Bolshakov, E. I. Bolshakova, A. P. Kotlyarov ad A. Gelbukh, Varous Crtera of Collocato Coheso Iteret: Comparso of Resolvg Power, Computatoal Lgustcs ad Itellget Text Processg, Lecture Notes Computer Scece, vol.4919, pp.64-72, 2010. [2] Che Yrog, The Research o Automatc Chese Term Extracto Itegrated wth Uthood ad Doma Feature, Master Thess Pekg Uversty, Bejg, 2005. [3] Yrog Che, Q Lu, Weje L, Zhfag Su ad Lug J, A Study o Termology Extracto Based o Classfed Corpora, Proceedgs of the Ffth Iteratoal Coferece o Laguage Resources ad Evaluato (LREC'06, pp.2383-2386, 2006. [4] K. Church ad P. Haks, Word assocato orms, mutual formato ad lexcography, Computatoal Lgustcs, vol.16, o.1, pp.22 29, 1990. [5] S. Evert, The Statstcs of Word Cooccurreces: Word Pars ad Collocatos, PhD dssertato, IMS, Uversty of Stuttgart, 2004. [6] C. Mag ad H. Schutze, Foudatos of statstcal atural laguage processg, MIT Press, Cambrdge, MA, 1999. [7] B. T. McIes, Extedg the Log Lkelhood Measure to Improve Collocato Idetfcato, M.S. Thess, Departmet of Computer Scece, Uversty of Mesota, Duluth, 2004. [8] P. Peca, Lexcal assocato measures ad collocato extracto, Lag Resources & Evaluato, vol.44, pp.137 158, 2010. [9] J. Pustejovsky, P. Ack, ad S. Bergler, Lexcal sematc techques for corpus aalyss, Computatoal Lgustcs, vol.19, o.2, pp.331-358, 1993. 58