Collocation Map for Overcoming Data Sparseness

Similar documents
Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

NUROP CONGRESS PAPER CHINESE PINYIN TO CHINESE CHARACTER CONVERSION

ENSC Discrete Time Systems. Project Outline. Semester

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Enhancing Performance of MLP/RBF Neural Classifiers via an Multivariate Data Distribution Scheme

Chapter 3: Cluster Analysis

Eric Klein and Ning Sa

INSTRUMENTAL VARIABLES

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

Part 3 Introduction to statistical classification techniques

Checking the resolved resonance region in EXFOR database

initially lcated away frm the data set never win the cmpetitin, resulting in a nnptimal nal cdebk, [2] [3] [4] and [5]. Khnen's Self Organizing Featur

the results to larger systems due to prop'erties of the projection algorithm. First, the number of hidden nodes must

A Scalable Recurrent Neural Network Framework for Model-free

Resampling Methods. Chapter 5. Chapter 5 1 / 52

Multiple Source Multiple. using Network Coding

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

1996 Engineering Systems Design and Analysis Conference, Montpellier, France, July 1-4, 1996, Vol. 7, pp

Modelling of Clock Behaviour. Don Percival. Applied Physics Laboratory University of Washington Seattle, Washington, USA

5 th grade Common Core Standards

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Perfrmance f Sensitizing Rules n Shewhart Cntrl Charts with Autcrrelated Data Key Wrds: Autregressive, Mving Average, Runs Tests, Shewhart Cntrl Chart

Writing Guidelines. (Updated: November 25, 2009) Forwards

A Matrix Representation of Panel Data

Math Foundations 20 Work Plan

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

Hypothesis Tests for One Population Mean

Computational modeling techniques

WRITING THE REPORT. Organizing the report. Title Page. Table of Contents

Optimization Programming Problems For Control And Management Of Bacterial Disease With Two Stage Growth/Spread Among Plants

MATCHING TECHNIQUES Technical Track Session VI Céline Ferré The World Bank

The blessing of dimensionality for kernel methods

NETSYN : a connectionist approach to synthesis knowledge acquisition and use

On Huntsberger Type Shrinkage Estimator for the Mean of Normal Distribution ABSTRACT INTRODUCTION

Data Mining: Concepts and Techniques. Classification and Prediction. Chapter February 8, 2007 CSE-4412: Data Mining 1

Churn Prediction using Dynamic RFM-Augmented node2vec

Comparing Several Means: ANOVA. Group Means and Grand Mean

Drought damaged area

The standards are taught in the following sequence.

Admissibility Conditions and Asymptotic Behavior of Strongly Regular Graphs

Kinetic Model Completeness

Weathering. Title: Chemical and Mechanical Weathering. Grade Level: Subject/Content: Earth and Space Science

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

, which yields. where z1. and z2

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

Emphases in Common Core Standards for Mathematical Content Kindergarten High School

Determining the Accuracy of Modal Parameter Estimation Methods

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

Lab 1 The Scientific Method

Analysis on the Stability of Reservoir Soil Slope Based on Fuzzy Artificial Neural Network

Lecture 13: Markov Chain Monte Carlo. Gibbs sampling

Sequential Allocation with Minimal Switching

Biocomputers. [edit]scientific Background

SIZE BIAS IN LINE TRANSECT SAMPLING: A FIELD TEST. Mark C. Otto Statistics Research Division, Bureau of the Census Washington, D.C , U.S.A.

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

Application Of Mealy Machine And Recurrence Relations In Cryptography

ROUNDING ERRORS IN BEAM-TRACKING CALCULATIONS

Chapter 8: The Binomial and Geometric Distributions

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

COMP 551 Applied Machine Learning Lecture 4: Linear classification

Chapter Summary. Mathematical Induction Strong Induction Recursive Definitions Structural Induction Recursive Algorithms

Admin. MDP Search Trees. Optimal Quantities. Reinforcement Learning

CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS

Combining Dialectical Optimization and Gradient Descent Methods for Improving the Accuracy of Straight Line Segment Classifiers

Methods for Determination of Mean Speckle Size in Simulated Speckle Pattern

Technical Bulletin. Generation Interconnection Procedures. Revisions to Cluster 4, Phase 1 Study Methodology

CONSTRUCTING STATECHART DIAGRAMS

A mathematical model for complete stress-strain curve prediction of permeable concrete

NAME: Prof. Ruiz. 1. [5 points] What is the difference between simple random sampling and stratified random sampling?

What is Statistical Learning?

We can see from the graph above that the intersection is, i.e., [ ).

Least Squares Optimal Filtering with Multirate Observations

Module 4: General Formulation of Electric Circuit Theory

A Regression Solution to the Problem of Criterion Score Comparability

BLAST / HIDDEN MARKOV MODELS

Subject description processes

THE LIFE OF AN OBJECT IT SYSTEMS

Tree Structured Classifier

AN INTERMITTENTLY USED SYSTEM WITH PREVENTIVE MAINTENANCE

Homology groups of disks with holes

1 The limitations of Hartree Fock approximation

Early detection of mining truck failure by modelling its operation with neural networks classification algorithms

Document for ENES5 meeting

BOUNDED UNCERTAINTY AND CLIMATE CHANGE ECONOMICS. Christopher Costello, Andrew Solow, Michael Neubert, and Stephen Polasky

15-381/781 Bayesian Nets & Probabilistic Inference

A Quick Overview of the. Framework for K 12 Science Education

You need to be able to define the following terms and answer basic questions about them:

Dead-beat controller design

Pure adaptive search for finite global optimization*

CHAPTER 3 INEQUALITIES. Copyright -The Institute of Chartered Accountants of India

Applications of latent trait theory to the development of norm-referenced tests.

o o IMPORTANT REMINDERS Reports will be graded largely on their ability to clearly communicate results and important conclusions.

NGSS High School Physics Domain Model

APPLICATION OF THE BRATSETH SCHEME FOR HIGH LATITUDE INTERMITTENT DATA ASSIMILATION USING THE PSU/NCAR MM5 MESOSCALE MODEL

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

NAME TEMPERATURE AND HUMIDITY. I. Introduction

MACE For Conformation Traits

Transcription:

Cllcatin Map fr Overcming Data Sparseness Mnj Kim, Yung S. Han, and Key-Sun Chi Department f Cmputer Science Krea Advanced Institute f Science and Technlgy Taejn, 305-701, Krea mj0712~eve.kaist.ac.kr, yshan~csking.kaist.ac.kr, kschi~csking.kai~t.ac.k~ Abstract Statistical language mdels are useful because they can prvide prbabilistic infrmatin upn uncertain decisin making. The mst cmmn statistic is n-grams measuring wrd cccurrences in texts. The methd suffers frm data shrtage prblem, hwever. In this paper, we suggest Bayesian netwrks be used in apprximating the statistics f insufficient ccurrences and f thse that d nt ccur in the sample texts with graceful degradatin. Cllcatin map is a sigmid belief netwrk that can be cnstructed frm bigrams. We cmpared the cnditinal prbabilities and mutual infrmatin cmputed frm bigrams and Cllcatin map. The results shw that the variance f the values frm Cllcatin map is smaller than that frm frequency measure fr the infrequent pairs by 48%. The predictive pwer f Cllcatin map fr arbitrary assciatins nt bserved frm sample texts is als demnstrated. 1 Intrductin In statistical language prcessing, n-grams are bar sic t many prbabilistic mdels including Hidden Markv mdels that wrk n the limited dependency f linguistic events. In this regard, Bayesian mdels (Bayesian netwrk, Belief netwrk, Inference diagram t name a few) are nt very different frm ItMMs. Bayesian mdels capture the cnditinal independence amng prbabilistic variables, and can cmpute the cnditinal distributin f the variables, which is knwn as a prbabilistic inferencing. The pure n-gram statistic, hwever, is smewhat crude in that it cannt d anything abut unbserved events and its apprximatin n infrequent events can be unreliable. In this paper we shw by way f extensive experiments that the Bayesian methd that als can be cmpsed frm bigrams can vercme the data sparseness prblem that is inherent in frequency cunting methds. Accrding t the empirical results, Cllcatin map that is a Bayesian mdel fr lexical variables induced graceful apprximatin ver unbserved and infrequent events. There are tw knwn methds t deal with the data sparseness prblem. They are smthing and class based methds (Dagan 1992). Smthing methds (Church and Gale 1991) readjust the distributin f frequencies f wrd ccurrences btained frm sample texts, and verify the distributin thrugh held-ut texts. As Dagan (1992) pinted ut, hwever, the values frm the smthing methds clsely agree with the prbability f a bigram cnsisting f tw independent wrds. Class based methds (Pereira et al. 1993) apprximate the likelihd f unbserved wrds based n similar wrds. Dagan and et al. (1992) prpsed a nn-hierarchical class based methd. The tw appraches reprt limited successes f purely experimental nature. This is s because they are based n strng assumptins. In the case f smthing methds, frequency readjustment is smewhat arbitrary and will nt be gd fr heavily dependent bigrams. As t the class based methds, the ntin f similar wrds differs acrss different methds, and the assciatin f prbabilistic dependency with the similarity (class) f wrds is t strng t assume in generm. Cllcatin map that is first suggested in (Itan 1993) is a sigmid belief netwrk with wrds as prbabilistic variables. Sigmid belief netwrk is extensively studied by Neal (1992), and has an efficient inferencing algrithm. Unlike ther Bayesian mdels, the inferencing n sigmid belief netwrk is nt NP-hard, and inference methds by reducing the netwrk and sampling are discussed in (Han 1995). Bayesian mdels cnstructed frm lcal dependencies prvide frmal apprximatin amng the variables, thus using Cllcatin map des nt require strng assumptin r intuitin t justify the assciatins amng wrds prduced by the map. The results f inferencing n Cllcatin map are prbabilities amng any cmbinatins f wrds represented in the map, which is nt fund 53

in ther mdels. One significant shrtcming f Bayesian mdels lies in the heavy cst f inferencing. Our implementatin f Cllcatin map includes 988 ndes, and takes 2 t 3 minutes t cmpute an assciatin between wrds. The purpse f experiments is t find ut hw gracefully Cllcatin map deals with the unbserved cccurrences in cmparisn with a naive bigram statistic. In the next sectin, Cllcatin map is reviewed fllwing the definitin in (Flail 1993). In sectin 3, mutual infrmatin and cnditinal prbabilities cmputed using bigrams and Cllcatin map are cmpared. Sectin 4 cncludes the paper by summarizing the gd and bad pints f the Cllcatin map and ther methds. 2 Cllcatin Map In this sectin, we make a brief intrductin n Cllcatin map, and refer t (ttan 1993) fr mre discussin n the definitin and t (ttan 1995) n infi~rence methds. Bayesian mdel cnsists f a netwrk and prbability tables defined n the ndes f the netwrk. The ndes in the netwrk repre.sent prbabilistic variables f a prblem dmain. The netwrk can cmpute prbabilistic dependency between an)" cmbinatin f the variables. The mdel is well dcumented as subjective prbability thery (Pearl 1988). Cllcatin map is an applicatin mdel f sigmld belief netwrk (Neal 1992) that belngs t belief netwrks which in turn is a type f Bayesian mdel. Unlike belief netwrks, Cllcatin map des nt have deterministic variables thus cnsists nly f prbabilistic variables that crrespnd t wrds in this case. Sigmid belief netwrk is different frm ther belief netwrks in that it des nt have prbability distributin table at each nde but weights n the edges between the ndes. A nde takes binary utcmes (1, -1) and the prbability that a nde takes an utcme given the vectr f utcmes f its preceding ndes is a sigmid functin f the utcmes and the weights f assciated edges. In this regard, the sigmid belief netwrk resembles artificial neural netwrk. Such prbabilities used t be stred at ndes in rdinary Bayesian mdels, and this makes the inferencing very difficult because the prbability table can be very big. Sigmid belief netwrk des away with the NP-hard cmplexity by aviding the tables at the lss f expressive generality f prbability distributins that can be encded in the tables. One wh wrks with Cllcatin map has t deal with tw prblems. The first is hw t cnstruct the netwrk, and the ther is hw t cmpute the prbabilities n the netwrk. Netwrk can be cnstructed directly frm a set f bigrams btained frm a training sample. Because Cllcatin map is a directed a~yclic graph, P( prfit I investment ) = 0.644069 P( risk-taking I investment ) = 0.549834 P( stck } investment ) = 0.546001 P( high-incme I investment ) = 0.564798 P( investment I high-incme ) = 0.500000 P( high-incme I risk-taking prfit ) = 0.720300 P( investment I prtfli high-incme risk-taking ) = 0.495988 P( prtfli I blue-chip ) = 0.500000 P( prtfli stck I prtfli stck ) = 1.000000 Figure 1: Example Cllcatin map and example inferences. Graph reductin methd (Hall 1995) is used in cmputing the prbabilities. cycles are avided by making additinal nde f a wrd when facing cycle due t the nde. N mre than tw ndes fr each wrd are needed t avid the cycle in any case (ltan 1993). Once the netwrk is setup, edges f the netwrk are assigned with weights that are nrmalized frequency f the edges at a nde. The inferencing n Cllcatin map is nt different frm that fr sigmid belief netwrk. The time cmplexity f inferencing by reducing graph n sigmid belief netwrks is O(N a) given N ndes (Han 1995). It turned ut that inferencing n netwrks cntaining mre than a few hundred ndes was nt practical using either nde reductin methd r sampling methd, thus we adpted the hybrid inferencing methd that first reduces the netwrk and applies Gibbs sampling methd (Hall 1995). Using the hybrid inferencing methd, cmputatin f cnditinal prbabilities tk less than a secnd fr a netwrk with 50 ndes, tw secnds fr a netwrk with 100 ndes, abut nine secnds fr a netwrk with 200 ndes, and abut tw minutes fr a netwrk with abut 1000 ndes. Cnditinal and marginal prbabilities can be apprximated frm Gibb's sampling. Sme cnditinal prbabilities cmputed frm a small netwrk are shwn in figure 1. Thugh the netwrk may nt be big enugh t mdel the dmain f finance, the resulting values frm the small netwrk cmpsed f 9 dependencies seem useful and intuitive. 54

20 average MI * variance 15 Mutual in Infrmatin ~v e~ O ~-~ ~,,. ~.'~:~. " e ~ g D ~ q 0 w ~@ ee.~dr'ee~ 0 ~ 0 C P O ie ~' 50 I00 150 200 Frequency f bigrams Figure 2: Average MI's and variances. 378,888 unique bigrams are classified accrding t frequency. 55

The cmputatin in figure 1 was dne by using graph reductin methd. As it is shwn in the example inferences, the assciatin between any cmbinatin f variables can be measured. 3 Experiments The gal f ur experiment is first t find hw data sparseness is related with the frequency based statistics and t shw Cllcatin map based methd gives mre reliable apprximatins. particular, frm the experiments we bserved the variances f statistics might suggest the level f data sparseness. The less frequent data tended t have higher variances thugh the values f statistics (mutual infrmatin fr instance) did nt distinguish the level f ccurrences. The predictive accunt f Cllcatin map is demnstrated by bserving the variances f apprximatins n the infrequent events. The tagged Wall Street Jurnal articles f Penn Tree crpus were used that cntain abut 2.6 millin wrd units. In the experiments, abut 1.2 millin f them was used. Prgrams were cded in C language, and run n a Sun Spare 10 wrkstatin. Fr the first 1.2 millin wrds, the bigrams cnsisting f fur types f categries (NN, NNS, IN, J J) were btained, and mutual infrmatin f each bigram (rder insensitive) was cmputed. The bi- grams were classified int 200 sets accrding t their ccurrences. Figure 2 summarizes the the average MI value and the variance f each frequency range. Frm figure 3 that shws the ccurrence distributin f 378,888 unique bigrams, abut 70% f them ccur nly ne time. One interesting and imprtant bservatin is that thse f 1 t 3 frequency range that take abut 90% f the ppulatin have very high MI values. This results als agree with Dunning's argument abut verestimatin n the infrequent ccurrences in which many infrequent pairs tend t get higher estimatin (Dunning 1993). The prblem is due t the assumptin f nrmality in naive frequency based statistics accrding t Dunning (1993). Apprximated values, thus, d nt indicate the level f data quality. Figure 3 shws variances can suggest the level f data sufficiency. Frm this bservatin we prpse the fllwing definitin n the ntin f data sparseness. A set f units belnging t a sample f rdered wrd units (texts) is cz datasparse if and nly if the variance f measurements n the set is greater than ~. The definitin sets the cncept f sparseness within the cntext f a fcused set f linguistic units. Fr a set f units unberved frm a sample, the given sample text is fr sure data-sparse. The abve definitin then gives a way t judge In with respect t bserved units. The measurement f data sparseness can be a gd issue t study where it may depend n the cntexts f research. Here we suggest a simple methd perhaps fr the first time in the literature. Figure 4 cmpares the results frm using Cllcatin map and simple frequency statistic. The variances are smaller and the pairs in frequency 1 class have nn zer apprximatins. Because cmputatin n Cllcatin map is very high, we have chsen 2000 unique pairs at randm. The netwrk cnsists f 988 ndes. Cmputing an apprximatin (inferencing) tk abut 3 minutes. The test size f 2000 pairs may nt be sufficient, but it shwed the cnsistent tendency f graceful degradatin f variances. The verestimatin prblem was nt significant in the apprximatins by Cllcatin map. The average value f zer frequency class t which 50 unbserved pairs belng was als n the line f smth degradatin, and figure 4 shws nly the variance. Table 1 summarizes the details f perfrmance gain by using Cllcatin map. 4 Cnclusin Crpus based natural language prcessing has been ne f the central subjects gaining rapid attentin frm the research cmmunity. The majr virtue f statistical appraches is in evaluating linguistic events and determining the relative imprtance f the events t reslve ambiguities. The evaluatin n the events (mstly cccurrences) in many cases, hwever, has been unreliable because f the lack f data. Data sparseness addresses the shrtage f data in estimating prbabilistic parameters. As a result, there are t many events unbserved, and even if events have been fund, the ccurrence is nt sufficient enugh fr the estimatin t be reliable. In cntrast with existing methds that are based n strng assumptins, the methd using Cllcatin map prmises a lgical apprximatin since it is built n a thrugh frmal argument f Bayesian prbability thery. The pwerful feature f the framewrk is the ability t make use f the cnditinal independence amng wrd units and t make assciatins abut unseen cccurrences based n bserved nes. This naturally induces the attributes required t deal with data sparseness. Our experiments cnfirm that Cllcatin map makes predictive apprximatin and avids verestimatin f infrequent ccurrences. One critical drawback f Cllcatin map is the time cmplexity, but it can be useful fr applicatins f limited scpe. 56

0.8 0.6 Percentage 0.4 0.2 0 0 2 4 6 8 10 Frequency f bigrams Figure 3: The distributin f 378,888 unique bigrams. First ten classes are shwn. 1 5.1 12.2 57% 10 2.28 4.28 46% 20 1.29 5.29 75% 30 1.51 3.51 56% 40 2.18 3.18 31% 50 1.52 2.87 47% average 2.04 4.5 45% Table 1: Cmparisn f variances between frequency based and Cllcatin map based MI cmputatins. 57

12 fre, luency based Cl[catic n ma[ 10 MI variance q 4 O. w 0 0 r 0 : uo ~ 0 ~ 0 0 0 0 0 0 0 0 0 0 0 0 0 Ug 0 5 I0 15 20 25 30 35 40 45 50 Frequency f bigrams Figure 4: Variances by frequency based and Cllcatin map based MI cmputatins fr 2000 unique bigrarns. 58

References Kenneth W. Church, and William A. Gale. 1991. A cmparisn f the enhanced Gd-Turing and deleted estimatin methds fr estimating prbabilities f English bigrams. Cmputer Speech and Language. 5. 19-54. Ted Dunning. 1993. Accurate methds fr the statistics f surprise and cincidence. Cmputatinal Linguistics. 19 (1). 61-74. Id Dagan, Shaul Marcus, and Shaul Markvitch. 1992. Cntextual wrd similarity and estimatin frm sparse data. In Prceedings f AAAI fall sympsium, Cambridge, MI. 164-171. Yung S. Han, Yung G. Han, and Key-sun Chi. 1992. Recursive Markv chain as a stchastic grammar. In Prceedings f a SIGLEX wrkshp, Clumbus, Ohi. 22-31. Yung S. Han, Yung C. Park, and Key-sun Chi. 1995. Efficient inferencing fr sigmid Bayesian netwrks, t appear in Applied Intelligence. Radfrd M. Neal. 1992. Cnnectinist learning f belief netwrks. J f Artificial Intelligence. 56. 71-113. Judea Pearl. 1988. Prbabilistic Reasning in Intelligent Systems. Mrgan Kaufmann Publishers. Fernand Pereira, Naftali Tishby, and Lillian Lee. 1993. Distributinal clustering f English wrds. In Prceedings f the Annual Meeting f the A CL. 59