Error Bounds for Context Reduction and Feature Omission

Similar documents
Complexity of Regularization RBF Networks

Sensitivity Analysis in Markov Networks

Probabilistic Graphical Models

The Effectiveness of the Linear Hull Effect

10.5 Unsupervised Bayesian Learning

An I-Vector Backend for Speaker Verification

Model-based mixture discriminant analysis an experimental study

A Spatiotemporal Approach to Passive Sound Source Localization

IMPEDANCE EFFECTS OF LEFT TURNERS FROM THE MAJOR STREET AT A TWSC INTERSECTION

Danielle Maddix AA238 Final Project December 9, 2016

LOGISTIC REGRESSION IN DEPRESSION CLASSIFICATION

Assessing the Performance of a BCI: A Task-Oriented Approach

Coding for Random Projections and Approximate Near Neighbor Search

Hankel Optimal Model Order Reduction 1

Chapter 8 Hypothesis Testing

Control Theory association of mathematics and engineering

Optimization of Statistical Decisions for Age Replacement Problems via a New Pivotal Quantity Averaging Approach

Development of Fuzzy Extreme Value Theory. Populations

Nonreversibility of Multiple Unicast Networks

Weighted K-Nearest Neighbor Revisited

Feature Selection by Independent Component Analysis and Mutual Information Maximization in EEG Signal Classification

Methods of evaluating tests

REFINED UPPER BOUNDS FOR THE LINEAR DIOPHANTINE PROBLEM OF FROBENIUS. 1. Introduction

Infomax Boosting. Abstract. 1. Introduction. 2. Infomax feature pursuit. Siwei Lyu Department of Computer Science Dartmouth College

Discrete Bessel functions and partial difference equations

Bilinear Formulated Multiple Kernel Learning for Multi-class Classification Problem

A model for measurement of the states in a coupled-dot qubit

A Small Footprint i-vector Extractor

The universal model of error of active power measuring channel

A NETWORK SIMPLEX ALGORITHM FOR THE MINIMUM COST-BENEFIT NETWORK FLOW PROBLEM

Advances in Radio Science

Distributed Gaussian Mixture Model for Monitoring Multimode Plant-wide Process

Estimating the probability law of the codelength as a function of the approximation error in image compression

A New Version of Flusser Moment Set for Pattern Feature Extraction

Maximum Entropy and Exponential Families

JAST 2015 M.U.C. Women s College, Burdwan ISSN a peer reviewed multidisciplinary research journal Vol.-01, Issue- 01

Resolving RIPS Measurement Ambiguity in Maximum Likelihood Estimation

arxiv:cond-mat/ v1 [cond-mat.stat-mech] 16 Aug 2004

Searching All Approximate Covers and Their Distance using Finite Automata

A Queueing Model for Call Blending in Call Centers

Array Design for Superresolution Direction-Finding Algorithms

Generalised Differential Quadrature Method in the Study of Free Vibration Analysis of Monoclinic Rectangular Plates

COMPUTER METHODS FOR THE DETERMINATION OF THE CRITICAL PARAMETERS OF POLLUTED INSULATORS

QCLAS Sensor for Purity Monitoring in Medical Gas Supply Lines

arxiv:gr-qc/ v2 6 Feb 2004

Estimation for Unknown Parameters of the Extended Burr Type-XII Distribution Based on Type-I Hybrid Progressive Censoring Scheme

General probability weighted moments for the three-parameter Weibull Distribution and their application in S-N curves modelling

A Unified View on Multi-class Support Vector Classification Supplement

Average Rate Speed Scaling

Sufficient Conditions for a Flexible Manufacturing System to be Deadlocked

A NONLILEAR CONTROLLER FOR SHIP AUTOPILOTS

Measuring & Inducing Neural Activity Using Extracellular Fields I: Inverse systems approach

The Second Postulate of Euclid and the Hyperbolic Geometry

Convergence of reinforcement learning with general function approximators

Modeling of discrete/continuous optimization problems: characterization and formulation of disjunctions and their relaxations

University of Groningen

Factorized Asymptotic Bayesian Inference for Mixture Modeling

Normative and descriptive approaches to multiattribute decision making

Modeling of Threading Dislocation Density Reduction in Heteroepitaxial Layers

Parallel disrete-event simulation is an attempt to speed-up the simulation proess through the use of multiple proessors. In some sense parallel disret

On the Bit Error Probability of Noisy Channel Networks With Intermediate Node Encoding I. INTRODUCTION

An Integrated Architecture of Adaptive Neural Network Control for Dynamic Systems

CS 687 Jana Kosecka. Uncertainty, Bayesian Networks Chapter 13, Russell and Norvig Chapter 14,

Likelihood-confidence intervals for quantiles in Extreme Value Distributions

Millennium Relativity Acceleration Composition. The Relativistic Relationship between Acceleration and Uniform Motion

Taste for variety and optimum product diversity in an open economy

The Influences of Smooth Approximation Functions for SPTSVM

After the completion of this section the student should recall

"Research Note" ANALYSIS AND OPTIMIZATION OF A FISSION CHAMBER DETECTOR USING MCNP4C AND SRIM MONTE CARLO CODES *

OPPORTUNISTIC SENSING FOR OBJECT RECOGNITION A UNIFIED FORMULATION FOR DYNAMIC SENSOR SELECTION AND FEATURE EXTRACTION

Analysis of discretization in the direct simulation Monte Carlo

Grasp Planning: How to Choose a Suitable Task Wrench Space

Scalable Positivity Preserving Model Reduction Using Linear Energy Functions

Ordering Generalized Trapezoidal Fuzzy Numbers. Using Orthocentre of Centroids

COMBINED PROBE FOR MACH NUMBER, TEMPERATURE AND INCIDENCE INDICATION

THEORETICAL ANALYSIS OF EMPIRICAL RELATIONSHIPS FOR PARETO- DISTRIBUTED SCIENTOMETRIC DATA Vladimir Atanassov, Ekaterina Detcheva

Neuro-Fuzzy Modeling of Heat Recovery Steam Generator

Research Collection. Mismatched decoding for the relay channel. Conference Paper. ETH Library. Author(s): Hucher, Charlotte; Sadeghi, Parastoo

Moments and Wavelets in Signal Estimation

SECOND HANKEL DETERMINANT PROBLEM FOR SOME ANALYTIC FUNCTION CLASSES WITH CONNECTED K-FIBONACCI NUMBERS

Multicomponent analysis on polluted waters by means of an electronic tongue

The Possibility of FTL Space Travel by using the Quantum Tunneling Effect through the Light Barrier

Adaptive neuro-fuzzy inference system-based controllers for smart material actuator modelling

Nonlinear Distributed Estimation Fusion That Reduces Mean Square Error

Calibration of Piping Assessment Models in the Netherlands

A RUIN MODEL WITH DEPENDENCE BETWEEN CLAIM SIZES AND CLAIM INTERVALS

Modal Horn Logics Have Interpolation

ONLINE APPENDICES for Cost-Effective Quality Assurance in Crowd Labeling

Monte Carlo Simulation of Electron and Radiative Emission from Silicon Diodes

3 Tidal systems modelling: ASMITA model

Optimization of replica exchange molecular dynamics by fast mimicking

A simple expression for radial distribution functions of pure fluids and mixtures

Aircraft CAS Design with Input Saturation Using Dynamic Model Inversion

23.1 Tuning controllers, in the large view Quoting from Section 16.7:

Collinear Equilibrium Points in the Relativistic R3BP when the Bigger Primary is a Triaxial Rigid Body Nakone Bello 1,a and Aminu Abubakar Hussain 2,b

The Tetrahedron Quality Factors of CSDS

11.1 Polynomial Least-Squares Curve Fit

arxiv: v1 [cond-mat.stat-mech] 20 Apr 2015

Long time stability of regularized PML wave equations

CRITICAL EXPONENTS TAKING INTO ACCOUNT DYNAMIC SCALING FOR ADSORPTION ON SMALL-SIZE ONE-DIMENSIONAL CLUSTERS

Transcription:

Error Bounds for Context Redution and Feature Omission Eugen Bek, Ralf Shlüter, Hermann Ney,2 Human Language Tehnology and Pattern Reognition, Computer Siene Department RWTH Aahen University, Ahornstr. 55, 5256 Aahen, Germany 2 Spoken Language Proessing Group, LIMSI CNRS, Paris, Frane {bek, shlueter, ney}@s.rwth-aahen.de Abstrat In language proessing appliations like speeh reognition, printed/handwritten harater reognition, or statistial mahine translation, the language model usually has a major influene on the performane, by introduing ontext. An inrease of ontext length usually improves perplexity and inreases the auray of a lassifier using suh a language model. In this work, the effet of ontext redution, i.e. the auray differene between a ontext sensitive, and a ontext-insensitive lassifier is onsidered. Context redution is shown to be related to feature omission in the ase of single symbol lassifiation. Therefore, the simplest non-trivial ase of feature omission will be analyzed by omparing a feature-aware lassifier that uses an emission model to a prior-only lassifier that statially infers the prior maximizing lass only and thus ignores the observation underlying the lassifiation problem. Upper and lower tight bounds are presented for the auray differene of these model lassifiers. The orresponding analyti proofs, though not presented here, were supported by an extensive simulation analysis of the problem, whih gave empirial estimates of the auray differene bounds. Further, it is shown that the same bounds, though not tightly, also apply to the original ase of ontext redution. This result is supported by further simulation experiments for symbol string lassifiation. Index Terms: language model, ontext, error bound. Introdution In appliations like automati speeh reognition, statistial mahine translation, printed/or handwritten harater reognition, lassifiation refers to string lasses, where eah lass represents a string (or sequene) of symbols (words, haraters, phonemes, et.). The orresponding language models, providing symbol probability distributions in symbol sequene ontext of varying length, are an important aspet of many natural language proessing tasks. Language modeling paradigms may be based on smoothed n-gram ounts [8], or on multilayer pereptrons [2]. Empirially, using longer ontext improves perplexity and, up to some extent, also the auray [3] of string lassifiers. Nevertheless, to the best of the authors knowledge, urrently no formal relation is known between the order of the Markov model used in the language model and the auray of a resulting reognition system. To disover orresponding bounds, an empirial Monte- Carlo approah was applied. To judge if a measure is a potential andidate for a bound, millions of distributions were simulated, disarding measures that did not exhibit a suitable bounding behavior on the auray differene of two lassifiers with different ontext length. If a bound existed, its funtional form was onjetured, followed by an attempt to find a formal proof. Information theory provides a number of bounds on the Bayes error itself. Examples for this are the Chernoff bound [4], the Lainiotis bound [], and the nearest neighbor bound [7]. These bounds do not provide information on the effet of ontext in string lassifiation, although the nearest-neighbor bound resembles a part of the lower bound presented here. In [5], an upper bound on the Bayes error of a string lassifier using two lasses is desribed. The bound is a funtion of the lass prior and requires a restrition on the lass onditional observation distribution. In [], two bounds on the auray differene between a Bayes single symbol lassifier and a model lassifier (e.g. one learned from data) are presented. These bounds are based on the squared distane and the Kullbak-Leibler divergene [9]. The Kullbak-Leibler based bound was later tightened and extended to the general lass of f-divergenes [6] in [2]. In this work, the feature-dependene of a lassifier is analyzed by omparing a feature-aware lassifier using an emission model to a prior-only lassifier that statially infers the prior maximizing lass only. The orresponding auray differene between suh a pair of lassifiers is shown to be losely related to the auray differene between a ontext sensitive, and a ontext-insensitive lassifier, being the original motivation for this work. Upper and lower tight bounds are presented for this auray differene. Although not presented here, analyti proofs are available. Extensive simulation analysis of the problem provided the initial hypothesis that lead to these proofs. Further derivations presented here also show that the derived bounds an be related to the auray differene indued by ontext length variation in a language model for symbol string lassifiation, whih is supported by simulation results. 2. Context Redution vs. Feature Omission Let C be a finite set of lasses (e.g. words, symbols, et.) and X be the set of observations. For simpliity X is assumed to be finite. Then the task of string lassifiation is to map a sequene of observations x N X N to a sequene of lasses N C N. Note that here the sequene of lasses and observations have the same length and no alignment problem is assumed, like in automati speeh reognition. An exemplary task, whih would be represented by this model would be part-of-speeh tagging. Let pr( N, x N ) = pr( N ) pr(x N N ) be the probability mass funtion of the true joint distribution, with the language model pr( N ) and the observation model pr(x N N ). Then the auray of a Bayes lassifier at position

i in the string of lasses is: A i = max x N N : i= pr( N )pr(x N N ) The language model is assumed to be a bigram: pr( N ) = N pr( n n ) n= From this bigram a position dependent unigram an be derived by marginalization for position i N: pr i() = pr( N ) = pr( i ) N : i= i : i= Also, it is assumed that the observation model pr(x N N ) only exhibits loal dependene: pr(x N N ) = N pr(x n n) n= To measure the effet of the language model ontext, the differene A i between the full, bigram-based lassifier s auray A i, and the auray of the redued ontext lassifier Ãi that is based on the derived unigram prior, is onsidered: A i = A i Ãi = x N with: pr i(, x N ) := max pr i(, x N ) x N : i= pr( N, x N ), pr i(, x) :=pr i()pr(x ). max pr i(, x), To emphasize the onnetion to single symbols, the last equation is rewritten as follows: A i = x i pr i(x i) A i(x i), () with the definition of the loal auray differene: A i(x i) := pr i(y x i) max pr i( y, x i) max pr i( x i), y=x N \x i and the marginals in symbol position i are, with y = x N \ x i: pr i(x) = pr( N )pr(x N N ) N,xN :x i=x pr i( x) = pri()pr(x ) pr i(x) pr i( y, x i) = pr i( x N ) = pri(, xn ) pr i(, x N ) pr i(, y x i) = pr i(, x N \ x i x i) = pri(, xn ) pr i(x i) pr i(y x i) = pr i(, y x i) (2) The loal auray differene defined in Eq. (2) atually shows the differene between the auraies of a single symbol lassifier that maps an observation y Y to a single lass C, and a lassifier that only uses the prior (mapping every observation to the same lass). Disarding the ondition on x i and replaing y with x, the auray differene for the ase of feature omission is obtained: A = A Ã = x max pr()pr(x ) max pr(), (3) for whih bounds will be derived in the following setion that also lead to similar bounds for the symbol string lassifiation ase introdued here. 3. Gini Differene Bounds Assume single symbol lassifiation, and define the following statistial measure for the differene between the lass posterior and the lass prior probability, whih will be alled Gini differene in the following: G = x = x pr(x) pr(x) pr( x) 2 [pr( x) pr()] 2 pr() 2 The term Gini differene is hosen here, as it is similar to the Gini riterion, as, e.g. used in deision tree learning. In [7], the minuend and subtrahend of the Gini differene are known as Bayesian distane. In the following, tight lower and upper bounds of the auray differene for the ase of feature omission are presented in terms of the Gini differene. The orresponding proofs are not presented for lak of spae, but are available from the authors on request. Note that both the Gini differene, and the auray differene an take values between and. Therefore, both measures are be normalized: A = A, G = G. As shown in the following, in terms of these normalized measures, the bounds do not expliitly depend on the number of lasses. 3.. Upper Bound The normalized auray differene defined in Eq. (3) is tightly bounded from above by the square root of the normalized Gini differene: A G. 3.2. Lower Bound The lower bound of the Gini differene onsists of three different segments. 3.2.. First Segment of the Lower Bound The (normalized) auray differene is positive: A, (4)

and equality an be obtained iff the normalized Gini differene is onstrained to: G 4. 3.2.2. Seond Segment of the Lower Bound Also, the normalized auray differene is linearly bounded from below by the normalized Gini differene minus a onstant: A G 4 This bound is tight for 4 G 3 4. 3.2.3. Third Segment of the Lower Bound If the Gini differene is onstrained to G 3 4, (5) then the set of tight lower bounds of the normalized auray differene is ompleted by: A G G 2 A ( A ) 2 The bounds are shown in Fig. in terms of normalized Gini differene and normalized auray differene. 3.3. Transition to Context Redution For the ase of symbol string lassifiation, the Gini differene an also be defined for a speifi symbol position i: G i := x i pr i(x i) G i(x i) Then Jensen s inequality [3, p. 82] an be applied to obtain the same bounds for the global, symbol string ase: A i = x i pr i(x i) A i(x i) x i pr i(x i)g ( G i(x i) ) (Eq.(6)) g ( x i pr i(x i) G i(x i) ) (Jensen s ineq., onave ase) g ( G i ) A i = x i pr i(x i) A i(x i) x i pr i(x i)f ( G i(x i) ) (Eq.(7)) f ( x i pr i(x i) G i(x i) ) (Jensen s ineq., onvex ase) f ( G i ) Nevertheless, it should be mentioned that these global bounds for the symbol string ase are not neessarily tight anymore, as is onfirmed by the simulations shown in the following setion. 4. Simulations 4.. Feature Omission: Single Symbol Case In order to determine the exat relation between the Gini differene and the auray differene, originally millions of distributions were simulated to alulate their values of the Gini, and the auray differene for a number of onfigurations. In Fig., the results of suh a simulation for 8 lasses and a set of 6 different disrete observations is presented. An upper and a lower bound for the auray differene as a funtion of the Gini differene is visible. This type of simulation also was performed for other ombinations of and X and from these results the upper and lower bounds presented in Se. 3 were hypothesized empirially by extensive analysis of the simulations, whih further led to orresponding proofs as presented in []. with the loal Gini differene: G i(x i) := pr i(y x i) pr i( y, x i) 2 pr i( x i) 2 y=x N \x i Apart from the additional ondition on x i, both the loal auray differene A i(x i), and the loal Gini differene G i(x i) effetively an be identified as single symbol ases, suh that the same upper and lower bounds apply, as derived for the feature omission ase in Subses. 3., and 3.2. Also, note that these upper and lower bounds are onave and onvex funtions, respetively. Now assume, these upper and lower bounds are represented by the following two funtions g and f, respetively (now assumed without normalization of Gini, and auray differene, without loss of generality), suh that: A i(x i) g ( G i(x i) ) (6) A i(x i) f ( G i(x i) ) (7) A Figure : Simulation results for = 8 lasses and X = 6 observations. Eah gray dot represents one simulated distribution. Also, the derived analyti tight upper and lower bounds are shown in red and blue, respetively. G

4.2. Context Redution: Symbol String Case The same experiments were performed for symbol string lassifiation. The upper and lower bounds from the symbol ase (feature omission) do hold for the string ase as shown in Setion 2, but the simulations suggest that in this ase the bounds are not tight any more, i.e. the simulations do not reah the bound in general, as shown in Fig. 2 number of simulations required to obtain good filling of the spae between the bounds inreases strongly. A2 A2 G2 G2 Figure 2: Simulation results for a string lassifier with = 5 lasses, X = observations, and sequene length N = 3. The auray/gini differene was alulated at position i = 2. Eah gray dot represents one simulated distribution. In the following Fig. 3, the number of lasses and observations X were proportionally redued, upon whih the spae between the analytial bounds is muh less filled. This might be due to the dependeny between the individual position s distributions, whih might be stronger for a lower number of lasses and observations. A2 Figure 4: Simulation results for a string lassifier with = 8 lasses, X = 9 observations, and sequene length N = 5. The auray/gini differene was alulated at position i = 3. Eah gray dot represents one simulated distribution. 5. Conlusions & Outlook In this work, upper and lower bounds on the auray differene for feature omission for single symbol lassifiation, and ontext redution for symbol string reognition were investigated. First of all, a relation between both ases was derived. Further, tight upper and lower bounds were presented for the single symbol ase. Monte-Carlo simulations played an important role in the disovery, as well as the formal proof of the bounds presented. Further simulations for the ase of ontext redution in symbol string lassifiation were presented, whih underline the relation between both ases. As suggested by these, the presented bounds, although being tight for the single symbol ase, do not seem to be tight in general for the symbol string ase. Nevertheless, the simulations strongly hint at the existene of tighter bounds for the symbol string ase, whih will be investigated in further work. To the knowledge of the authors, the bounds presented are the first to analytially support the empirially observed effet of feature omission and ontext redution on the auray. 6. Aknowledgments G2 Figure 3: Simulation results for a string lassifier with = 3 lasses, X = 6 observations, and sequene length N = 3. The auray/gini differene was alulated at position i = 2. Eah gray dot represents one simulated distribution. When (slightly) inreasing the length N, apparently no strong differene an be observed, as shown in Fig. 4. The number of observations here was redued somewhat, as the omplexity of the simulations apparently is exponential and the The authors would like to thank Tamer Alkhouli and Malte Nuhn for many insightful onversations on this topi. This work has been supported by a ompute time grant on the RWTH ITC luster. This work was partly funded under the projet EUBridge (FP7-287658). H. Ney was partially supported by a senior hair award from DIGITEO, a Frenh researh luster in Ile-de-Frane. 7. Referenes [] Y. Bengio, R. Duharme, P. Vinent, and C. Jauvin, A Neural Probabilisti Language Model, Journal of Mahine Learning Researh, Vol. 3, pp. 37 55, 23. [2] G. Casella, R.L. Berger. Statistial Inferene, Duxbury Press, Belmont, California, 99, 65 pages. [3] H. Chernoff, A Measure of Asymptoti Effiieny for Tests of

a Hypothesis Based on the Sum of Observations, The Annals of Mathematial Statistis, Vol. 23, No. 4, pp. 493 57, 952. [4] J. Chu, Error Bounds for a Contextual Reognition Proedure, IEEE Transations on Computers, Vol. C-2, No., pp. 23 27, Ot 97. [5] I. Csiszár, Eine informationstheoretishe Ungleihung und ihre Anwendung auf den Beweis der Ergodizitat von Markoffshen Ketten, Magyar. Tud. Akad. Mat. Kutató Int. Közl, Vol. 8, pp. 85 8, 963. [6] P. A. Devijver, On a New Class of Bounds on Bayes Risk in Multihypothesis Pattern Reognition, IEEE Transations on Computers, Vol. C-23, No., pp. 7 8, Jan. 974. [7] R. Kneser and H. Ney, Improved Baking-Off for m-gram Language Modeling, in Pro. IEEE Intern. Conf. on Aoustis, Speeh, and Signal Proessing, Vol., pp. 8 84, Detroit, MI, May 995. [8] S. Kullbak and R. Leibler, On Information and Suffiieny, The Annals of Mathematial Statistis, Vol. 22, No., pp. 79 86, 95. [9] D. Lainiotis, A lass of upper bounds on probability of error for multihypotheses pattern reognition (orresp.), IEEE Transations on Information Theory, Vol. 5, No. 6, pp. 73 73, Nov. 969. [] H. Ney, On the Relationship Between Classifiation Error Bounds and Training Criteria in Statistial Pattern Reognition, in Pro. Iberian Conferene on Pattern Reognition and Image Analysis, pp. 636 645, Puerto de Andratx, Spain, Jun. 23. [] R. Shlüter, M. Nußbaum-Thom, E. Bek, T. Alkhouli, and H. Ney, Novel Tight Classifiation Error Bounds under Mismath Conditions Based on f-divergene, in Pro. IEEE Information Theory Workshop, pp. 432 436, Sevilla, Spain, Sep. 23. [2] H. Shwenk, Continuous Spae Language Models, Computer Speeh & Language, Vol. 2, No. 3, pp. 492 58, 27.