MAXIMUM A POSTERIORI TRANSDUCTION

Similar documents
Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Generalized Linear Methods

Semi-supervised Classification with Active Query Selection

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Learning Theory: Lecture Notes

Kernel Methods and SVMs Extension

The Order Relation and Trace Inequalities for. Hermitian Operators

Natural Language Processing and Information Retrieval

Lecture Notes on Linear Regression

Maximum Likelihood Estimation (MLE)

Convexity preserving interpolation by splines of arbitrary degree

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Estimation: Part 2. Chapter GREG estimation

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

ANOMALIES OF THE MAGNITUDE OF THE BIAS OF THE MAXIMUM LIKELIHOOD ESTIMATOR OF THE REGRESSION SLOPE

A new Approach for Solving Linear Ordinary Differential Equations

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1

The Minimum Universal Cost Flow in an Infeasible Flow Network

18.1 Introduction and Recap

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

A new construction of 3-separable matrices via an improved decoding of Macula s construction

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Comparison of the Population Variance Estimators. of 2-Parameter Exponential Distribution Based on. Multiple Criteria Decision Making Method

Linear Approximation with Regularization and Moving Least Squares

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

10-701/ Machine Learning, Fall 2005 Homework 3

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

ECE559VV Project Report

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Chapter 11: Simple Linear Regression and Correlation

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Boostrapaggregating (Bagging)

Classification as a Regression Problem

1 Convex Optimization

Multilayer Perceptron (MLP)

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

VARIATION OF CONSTANT SUM CONSTRAINT FOR INTEGER MODEL WITH NON UNIFORM VARIABLES

The Expectation-Maximization Algorithm

Polynomial Regression Models

Société de Calcul Mathématique SA

Fundamentals of Neural Networks

Lecture 3: Shannon s Theorem

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

The exam is closed book, closed notes except your one-page cheat sheet.

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

EGR 544 Communication Theory

Feature Selection: Part 1

MDL-Based Unsupervised Attribute Ranking

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Chapter 9: Statistical Inference and the Relationship between Two Variables

On the Repeating Group Finding Problem

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Feature Selection in Multi-instance Learning

Asymptotics of the Solution of a Boundary Value. Problem for One-Characteristic Differential. Equation Degenerating into a Parabolic Equation

FREQUENCY DISTRIBUTIONS Page 1 of The idea of a frequency distribution for sets of observations will be introduced,

Foundations of Arithmetic

A Robust Method for Calculating the Correlation Coefficient

Global Sensitivity. Tuesday 20 th February, 2018

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Supporting Information

Chapter Newton s Method

Singular Value Decomposition: Theory and Applications

On mutual information estimation for mixed-pair random variables

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Lecture 20: November 7

Module 9. Lecture 6. Duality in Assignment Problems

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

Hidden Markov Models

Lecture 5 Decoding Binary BCH Codes

Relevance Vector Machines Explained

} Often, when learning, we deal with uncertainty:

Vapnik-Chervonenkis theory

The lower and upper bounds on Perron root of nonnegative irreducible matrices

Yong Joon Ryang. 1. Introduction Consider the multicommodity transportation problem with convex quadratic cost function. 1 2 (x x0 ) T Q(x x 0 )

Homework Assignment 3 Due in class, Thursday October 15

Clustering & Unsupervised Learning

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Lecture 10 Support Vector Machines. Oct

ISSN: ISO 9001:2008 Certified International Journal of Engineering and Innovative Technology (IJEIT) Volume 3, Issue 1, July 2013

TAIL BOUNDS FOR SUMS OF GEOMETRIC AND EXPONENTIAL VARIABLES

Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Integrals and Invariants of Euler-Lagrange Equations

The Geometry of Logit and Probit

Entropy Coding. A complete entropy codec, which is an encoder/decoder. pair, consists of the process of encoding or

A PROBABILITY-DRIVEN SEARCH ALGORITHM FOR SOLVING MULTI-OBJECTIVE OPTIMIZATION PROBLEMS

Matrix Approximation via Sampling, Subspace Embedding. 1 Solving Linear Systems Using SVD

Explaining the Stein Paradox

Support Vector Machines

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Problem Set 9 Solutions

Lecture 3 Stat102, Spring 2007

Transcription:

MAXIMUM A POSTERIORI TRANSDUCTION LI-WEI WANG, JU-FU FENG School of Mathematcal Scences, Peng Unversty, Bejng, 0087, Chna Center for Informaton Scences, Peng Unversty, Bejng, 0087, Chna E-MIAL: {wanglw, fjf}@cs.pu.edu.cn Abstract: Transducton deals wth the problem of estmatng the values of a functon at gven ponts (called worng samples) by a set of tranng samples. Ths paper proposes a maxmum a posteror (MAP) scheme for the transducton. The probablty measure defned for the estmaton s nduced by the code length of the predcton error the model wth respect to some codng system. The deal MAP transducton s essentally to mnmze the so-called stochastc complexty. Approxmatons to the deal MAP transducton are also addressed, where one or multple models of the functon are estmated as well as the values at the worng sample. Ths wor nvestgates, for both pattern classfcaton regresson, that under what condton the approxmated MAP transducton s better than the tradtonal nducton, whch learns models from the tranng samples then computes the value at the gven ponts. Analyss on whether the worng samples compress the descrpton length of the model s also presented. For some codng system t does, for some others t doesn t. For farness, a unversal codng system should be adopted, but t nvolves the problem of not recursvely computable. Keywords: Transducton; maxmum a posteror; mnmum descrpton length; stochastc complexty. Introducton Transducton [] deals wth the problem of estmatng the values of a functonal dependency at gven ponts x, x,, x (called worng sample) by a set of n+ n+ n+ tranng sample ( x, y),( x, y),,( xn, yn). In the tradtonal nducton framewor, one learns models from the tranng sample then computes the value at the worng sample. But n some cases ths s not the best. The concept of transducton was ntroduced by Vapn []. The bacground phlosophy s that, a soluton of a relatvely smple problem (estmatng values on gven ponts) should not depend on the soluton of a substantally more complex problem (learnng a model). Transducton has been appled n text classfcaton wth SVMs []. Ths paper proposes a maxmum a posteror (MAP) framewor for the transducton. The probablty measure we defne for the estmaton s nduced by the code length of the predcton error the model wth respect to some codng system. The deal MAP transducton s essentally selectng yn +, yn +,, yn + so as to mnmze the stochastc complexty [3]. In some applcatons, one wants to estmate not only the values at the gven ponts, but an approxmated model or multple models for the underlyng functonal. Here, we use the word model nstead of functon to: a) dstngush t wth the real functonal dependency between x y ; b) emphasze that the model has to be chosen from some model class. But t should be clear that when we say a model, we mean a functon. So f M s a model, M ( x ) s the value of M on x. The MAP approach can be modfed to solve such problems. We show that the nducton may be equvalent to the MAP transducton f just one model s consdered. But t can hardly acheve the maxmal posteror probablty when multple models are estmated smultaneously. Both classfcaton (ndcator functon) regresson (real-valued functon) are addressed. It turns out that they are nherently dfferent n the MAP transducton. It s wdely beleved that the mproved performance over the nducton s due to the nformaton contaned n the worng samples [4], [5]. We nvestgate whether the worng sample can compress the descrpton length of the model. For pattern classfcaton, the answer s postve wth respectve to some codng system, whle negatve for other codng systems. The postve answer could not be extended to regresson problems drectly, because ( yn+, yn+ ) taes contnuous values. It should be mentoned that the Bayesan transducton has been suggested n [6], but we are dfferent from t not only on the probablty measure, but on what probablty are defned.

. Code Length the Probablty Measure When we deal wth data n an applcaton, a probablty dstrbuton s usually assumed to them. But the probablty measures defned n ths paper have no relaton to ths natural romness. All the data are seen as numbers (or vectors), so to the models. Let M be a model. We shall frst defne the probablty Py ( xm, ). Assume LyMx (, ( )) s a preassgned loss functon that measures the dstance between y M ( x ). Some examples are: LyMx (, ( )) = ( y Mx ( )). () 0 y = M( x). LyMx (, ( )) = otherwse. () LyMx (, ( )) = max(0, ymx ( )). (3) where (3) s equvalent to LyMx (, ( )) = ξ. Defne st.. ym( x) ξ (4) ξ 0 Py ( xm, ) Ke L( ym, ( x)) =. (5) M where K s a normalzaton coeffcent. We pont out agan that (5) does not mean any romness to x, y M. To explan what ths probablty measure s, note that by Kraft nequalty [3], there exsts a prefx code that encodng LyMx (, ( )) satsfyng: ( (, ( ))) Py ( xm, ) = c L y M x. (6) where clym ( (, ( x ))) s the code length of LyMx (, ( )). If y M ( x ) taes contnuous value, the so defned s a densty functon f LyMx (, ( )) s a dstance measure. We can stll encode t by some approprate quantzaton. So Py ( xm, ) represents the descrpton length of y gven x M, wth respect to the loss functon. Ths result can be extended to a set of ndependent pars ( x, y), ( x, y),,( xn, y n) naturally. In the rest of the paper, we shall always let X represent a set of x s, so do Y. Next, we defne n ths way: Suppose we are agreed at some codng system (a decodng functon). Gven X, the model M (may need some quantzaton frst) can be decoded from X together wth a strng s. We denote the decodng functon as M = d( X, s) (7) Defne ( = K c s ). (8) where s s the shortest strng such that M = d( X, s ), K s a normalzaton coeffcent. reduces to PM ( ) when no X s gven. It should be ponted out that we have to mpose some restrctons on M. That s, M s chosen from some class of models C, otherwse for any non-zero K =. (9) represents how many bts needed to descrbe M gven X wth respect to the codng system. PY ( X, M) PM) ( are often used explctly or mplctly n many applcatons. Maxmzaton of PY ( X, M) s mnmzaton of the emprcal rs. PM ( ) may be determned by the number of free parameters n. These probablty measures, f properly employed, can gve satsfed estmatons whatever the underlyng dstrbuton of ( x, y ) s. A well-nown result due to Vapn [] s the consstency of the Structural Rs Mnmzaton (SRM). It can be easly checed that the followng two probablty measures are well defned: PY (, M X) = PY ( X, M) PM ( X ). (0) PY ( X) = PY (, M X ). () PY (, M X) represents the code length to descrbe Y gven X. Ths s a two-part code, frst part for the predcton error the second for the model. But ths s not the best codng scheme, snce there s redundancy on codng the model. It s PY ( X) that reveals the necessary bts to descrbe Y gven X, the code length s, by nformaton theory, log PY ( X). Ths s essentally the stochastc complexty ntroduced by Rssanen [3]. Indeed, to defne the probablty measure, the data cannot be consdered as rom varables wth a jont

probablty PXY (, ), otherwse PY ( X, M) does not depend on M at all. Ths s where we are dfferent from other wors on Bayes transducton [6]. 3. MAP Estmaton Let ( x, y ),( x, y ),,( x, y ) sample, x, x,, x n+ n+ n+, y n + n n be the tranng be the worng sample, yn+, yn+, be the correspondng values. To smplfy the notons, denote X = ( x, x,, x n ), Y = ( y, y,, y n ), X = ( x, x,, x ), () Y n+ n+ n+ Y = ( y, y,, y ). n+ n+ n+ Y s: The MAP estmaton of Yˆ = arg max P( Y, Y X, X) Y (3) = arg max PY (, Y X, X, M) PM ( X, X) The deal MAP transducton gves no model for the underlyng functonal dependency. What s more, t s ntractable f the class C contans nfntely many models. We next loo at an approxmaton of (3). In some applcatons, one cares of the underlyng functonal dependency as well as Y. That s, one selects a model M from a class C, estmates Y smultaneously. The correspondng MAP estmator s: ( Yˆ, Mˆ ) = argmax P( Y, Y, M X, X ) Y (4) = arg max PY ( X, M) PY ( X, M) PM ( X, X ). Y Applyng the MAP to the nducton, denote Yˆ, M ˆ as the estmator, we have: Mˆ = arg max PY ( X, M) PM ( X), (5) ˆ ˆ Y = arg max P( Y X, M ). Y In fact, most applcatons do not employ, but PM ( ) nstead. We wll dscuss the dfference n the next secton. We are nterested n whether the MAP transducton s always better than the nducton. From (4) (5), f PM ( X, X) = PM ( X), (6) max PY ( X, M) = f( X ). (7) Y where f( X ) s a functon ndependent of M, then the two estmatons are dentcal. (6) means that X does not compress the code length of M gven X. We wll nvestgate t n the next secton. For most regresson problems, each component of Y can tae arbtrary value, the loss can be mnmzed to zero, so max PY ( X, M) does not depend on M (e.g. least square regresson as ()). The stuaton s dfferent for pattern classfcaton. Here, M s a classfer, such as T M ( x) = w x+ b, whch can tae contnuous value. But y {, + }, so ma x PY ( X, M) depends on M (as n (3)). In ths case, the MAP transducton may outperform over the nducton. The advantage s due to the type msmatch between M ( x) y. But ths s not true for those classfers that M( x) {, + }. The MAP estmaton wth one model as descrbed above s essentally to fnd a mnmal length two-part code. There s another approxmaton to the deal MAP transducton: estmatng multple ndependent models M,, M s from classes C,, Cs smultaneously as well as Y. ( Yˆ, Mˆ,, Mˆ ) = argmax P( Y, Y, M X, X ) s Y M C (8) = arg max PY (, Y X, X, M) PM ( X, X ). Y Apply the MAP to the nducton: Mˆ = arg max PY ( X, M) PM ( X), (9) ˆ ˆ ˆ Y = φ( M ( X),, Ms ( X)). where φ s a functon mxng Mˆ ( X ˆ ),, Ms ( X) up (e.g. votng). Agan we analyze f the MAP transducton the nducton are equvalent. Leave PM ( X, X) PM ( ) asde. In (9),,, ˆ X Mˆ M s are estmated separately, so they are ndependent to each other. Whle n (8), M ˆ ˆ,, M s are closely related snce Y s nvolved. Ths argument suggests that the MAP transducton wth multple models may always better than the nducton for both classfcaton regresson. 4. Compresson of the Model by the Worng Sample In ths secton we study whether X provde any help on compressng the model. That s whether

PM ( X) = PM ( ) (0) PM ( X, X) = PM ( X) () hold. We analyze (0) only, we thn there s no much dffculty to extend the followng argument to (). From (4) (5), we see that f (0) () hold, all the dfference between transducton nducton are caused by the type msmatch or the multple models nvolved. For smplcty, we use X = ( x,, x n ) nstead of X n ths secton. ( We have defned as proportonal to c s ) : M can be decoded from X together wth the shortest s. For the pattern classfcaton problem, consder the followng codng system. For arbtrary M, let strng s be the followng: the frst n bts are assgned to y, y,, yn each wth + or, the rest of s s denoted by M. To decode M, one runs wth ( x, y),( x, y ),,( x, ) n yn some tranng algorthm such as SVMs. Here, the classfer has to be unquely determned by the tranng algorthm because of the unqueness request for the output of a decoder. So classfers le the perceptrons are not approprate. At ths stage, we have the tranng result: a classfer M, we next use M adjust M to obtan M. From the above descrpton, heavly depends on X. For approprate X, t needs at most n extra bts to decode M, whch yelds K n, whle wthout X, t s mpossble that each M has such a large probablty. Note that the argument above s wth respect to a specal codng system. For another codng system, e.g. a trval decoder that does not consder X at all, (0) certanly holds (Ths codng system s often used n real applcaton mplctly). For farness, t s necessary to consder a unversal codng system,.e. a unversal Turng machne. Unfortunately, the two probabltes are not recursvely computable n ths stuaton. So we have no dea about t. It should be noted that even n the frst codng system, the compresson ganed s due to the effcency of encodng y, y,, yn n the tas of classfcaton. The argument s no longer vald for regresson, snce taes contnuous value. The bts needed to encode y, y,, yn tend to nfnty when more quantzaton levels set on y. Whether there s any codng system that maes X contrbute to M s not clear. We strongly suspect that for the least square regresson wthout any constrant on y, X does not contrbute to M wth respect to the unversal codng y system. 5 Concluson We propose a MAP scheme for transducton. We analyze the deal MAP transducton as well as two approxmatons. The deal MAP transducton s essentally a mnmzaton of the stochastc complexty. The two approxmatons estmate one multple models respectvely as well as Y. Transducton for both pattern classfcaton regresson are addressed. For classfcaton, the MAP transducton may outperform over the nducton due to a) the type msmatch between y M ( x), b) for some codng system PM ( X, X) (but t s not guaranteed wth respectve to a unversal codng system). For regresson, f we estmate wth respectve to a codng system that P ( M X, X ) = P( M X ), the MAP transducton s equvalent to the nducton when only one model s estmated. But t s hardly so when multple models are estmated smultaneously. There are two problems wth the MAP transducton. a) The computaton problem: Even for the smplest case --- classfcaton wth one model, the MAP estmaton s not easy to mplement. Approxmaton methods have to be consdered. Further more, can not be computed drectly. How to adopt the mutual nformaton nto the transducton s a problem. b) The optmalty of the MAP estmaton: MAP s sometmes not robust to nosy data. In nosy envronment, how to mprove the MAP transducton. Acnowledgements Ths wor s supported by the Natonal Natural Scence Foundaton of Chna 6075004. References [] Vapn, V.: Statstcal Learnng Theory. Wley Inter-scence (998). [] Joachms, T.: Transductve Inference for Text Classfcaton Usng Support Vector Machnes. In Internatonal Conference on Machne Learnng (ICML). (999) 00-09 [3] Rssanen, J.: Stochastc Complexty n Statstcal Inqury. World Scentfc, Sngapore (989)

[4] Blum, A., Mtchell, T.: Combnng Labeled Unlabeled data for Co-Tranng. In Proceedngs of the 998 Conference on Computatonal Learnng Theory (COLT). (998) [5] Ngam,. etc.: Text Classfcaton from Labeled Unlabeled Documents Usng EM, Machne Learnng, Vol. 39 (000) 03-34, [6] Graepel, T., Ralf, H., Obermayer, K.: Bayesan Transducton. In Advances n Neural Informaton System Processng (NIPS). (000) 456-46