Transductive Inference for Estimating Values of Functions

Similar documents
Efficient Leave-One-Out Cross-Validation of. Kernel Fisher Discriminant Classifiers

Learning Kernel Parameters by using Class Separability Measure

Feature Selection for SVMs

Support Vector Machine via Nonlinear Rescaling Method

Cluster Kernels for Semi-Supervised Learning

Universal Learning Technology: Support Vector Machines

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.

Semi-Supervised Learning through Principal Directions Estimation

IMPROVING THE AGGREGATING ALGORITHM FOR REGRESSION

Adaptive Sparseness Using Jeffreys Prior

Discriminative Models

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Probabilistic Regression Using Basis Function Models

Incorporating Invariances in Nonlinear Support Vector Machines

Aijun An and Nick Cercone. Department of Computer Science, University of Waterloo. methods in a context of learning classication rules.

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

An Introduction to Statistical and Probabilistic Linear Models

Support Vector Machines

Training algorithms for fuzzy support vector machines with nois

Polyhedral Computation. Linear Classifiers & the SVM

Support Vector Regression Machines

Discriminative Models

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Predicting the Probability of Correct Classification

Advanced Machine Learning

An Empirical Study of Building Compact Ensembles

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machines

Tutorial on Venn-ABERS prediction

Analysis of Multiclass Support Vector Machines

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

A Result of Vapnik with Applications

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees

Support Vector Machine & Its Applications

Support Vector Method for Multivariate Density Estimation

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Discriminative Learning and Big Data

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

The Decision List Machine

Top-k Parametrized Boost

Robust Kernel-Based Regression

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

COMS 4771 Lecture Boosting 1 / 16

References. Lecture 7: Support Vector Machines. Optimum Margin Perceptron. Perceptron Learning Rule

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Principles of Risk Minimization for Learning Theory

SUPPORT VECTOR MACHINE

FEATURE SELECTION COMBINED WITH RANDOM SUBSPACE ENSEMBLE FOR GENE EXPRESSION BASED DIAGNOSIS OF MALIGNANCIES

Jeff Howbert Introduction to Machine Learning Winter

Introduction. Chapter 1

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Introduction to Support Vector Machines

CMU-Q Lecture 24:

Learning Gaussian Process Models from Uncertain Data

Gaussian Process Regression: Active Data Selection and Test Point. Rejection. Sambu Seo Marko Wallat Thore Graepel Klaus Obermayer

NON-FIXED AND ASYMMETRICAL MARGIN APPROACH TO STOCK MARKET PREDICTION USING SUPPORT VECTOR REGRESSION. Haiqin Yang, Irwin King and Laiwan Chan

Least Squares SVM Regression

Support Vector Machine (continued)

Reading, UK 1 2 Abstract

Statistical Learning Reading Assignments

Learning with multiple models. Boosting.

Sparse Support Vector Machines by Kernel Discriminant Analysis

EVALUATING MISCLASSIFICATION PROBABILITY USING EMPIRICAL RISK 1. Victor Nedel ko

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION

Perceptron Revisited: Linear Separators. Support Vector Machines

Foundations of Machine Learning

Extended Input Space Support Vector Machine

Statistical Methods for SVM

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Dyadic Classification Trees via Structural Risk Minimization

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

6.036 midterm review. Wednesday, March 18, 15

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

CS798: Selected topics in Machine Learning

Online Learning and Sequential Decision Making

ESS2222. Lecture 4 Linear model

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

ML (cont.): SUPPORT VECTOR MACHINES

1162 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 5, SEPTEMBER The Evidence Framework Applied to Support Vector Machines

Predictive Ensemble Pruning by Expectation Propagation

Linear & nonlinear classifiers

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Summary and discussion of: Dropout Training as Adaptive Regularization

Approximate Model Selection for Large Scale LSSVM

Name (NetID): (1 Point)

Linear Classification and SVM. Dr. Xin Zhang

Classification of Ordinal Data Using Neural Networks

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Introduction to Gaussian Processes

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Classifier Complexity and Support Vector Classifiers

Cellwise robust regularized discriminant analysis

ECE521 week 3: 23/26 January 2017

Coulomb Classifiers: Generalizing Support Vector Machines via an Analogy to Electrostatic Systems

Transcription:

Transductive Inference for Estimating Values of Functions Olivier Chapelle* Vladimir Vapnik*t Jason Westontt.t* * AT&T Research Laboratories Red Bank USA. t Royal Holloway University of London Egham Surrey UK. tt Barnhill BioInformatics.com Savannah Georgia USA. { chapelle vlad weston} @research.att.com Abstract We introduce an algorithm for estimating the values of a function at a set of test points Xe+!... xl+m given a set of training points (XIYI)... (xeye) without estimating (as an intermediate step) the regression function. We demonstrate that this direct (transducti ve) way for estimating values of the regression (or classification in pattern recognition) can be more accurate than the traditionalone based on two steps first estimating the function and then calculating the values of this function at the points of interest. 1 Introduction Following [6] we consider a general scheme of transductive inference. Suppose there exists a function y* = fo(x) from which we observe the measurements corrupted with noise ((Xl YI)". (xe Ye)) Yi = Y; + ~i' (1) Find an algorithm A that using both the given set of training data (1) and the given set of test data (Xl+!'.. XHm) selects from a set of functions {x t--+ f (x)} a function Y = f(x) = fa(xlxlyi... XlYlXHI""XHm) and minimizes at the points of interest the functional R(A) = E (~ (y; - fa(xilxlyl... XlYeXl+l... Xl+m))2) i=l+l expectation is taken over X and~. For the training data we are given the vector X and the value Y for the test data we are only given x. Usually the problem of estimating values of a function at points of interest is sol ved in two steps: first in a given set of functions f (x a) a E A one estimates the regression i.e the function which minimizes the functional R(a) = J ((y - f(x a))2df(x Y) (5) (2) (3) (4)

422 0. Chapelle V. N. Vapnik and J. Weston (the inductive step) and then using the estimated function Y = f(xal) we calculate the values at points of interest yi = f(x; ae) (6) (the deductive step). Note however that the estimation of a function is equivalent to estimating its values in the continuum points of the domain of the function. Therefore by solving the regression problem using a restricted amount of information we are looking for a more general solution than is required. In [6] it is shown that using a direct estimation method one can obtain better bounds than through the two step procedure. In this article we develop the idea introduced in [5] for estimating the values of a function only at the given points. The material is organized as follows. In Section 1 we consider the classical (inductive) Ridge Regression procedure and the leave-one--out technique which is used to measure the quality of its solutions. Section 2 introduces the transductive method of inference for estimation of the values of a function based on this leave-one- out technique. In Section 3 experiments which demonstrate the improvement given by transductive inference compared to inductive inference (in both regression and pattern recognition) are presented. Finally Section 4 summarizes the results. 2 Ridge Regression and the Leave-One-Out procedure In order to describe our transductive method let us first discuss the classical twostep (inductive plus deductive) procedure of Ridge Regression. Consider the set of functions linear in their parameters n f(x a) = L aicpi(x). (7) i=1 To minimize the expected loss (5) F(x y) is unknown we minimize the following empirical functional (the so-called Ridge Regression functional [1]) l 1 ~ 2 2 Remp(a) = e L)Yi - f(xi a)) + 1'110.11 i=1 l' is a fixed positive constant called the regularization parameter. The minimum is given by the vector of coefficients ae = a(xl Yl... Xl Yl) = (KT K + 1'1)-1 KTy (9) y = (Y1... Ylf and K is a matrix with elements: Kij=cPj(Xi) i=i... j=i... n. The problem is to choose the value l' which provides small expected loss for training on a sample Sl = {(XlYl)... (XlYl)}. For this purpose we would like to choose l' such that f"f minimizing (8) also minimizes R = J (Y* - f"f(x* ISl))2dF(x* y*)df(se). (12) (8) (10) (11)

Transductive Inference for Estimating Values of Functions 423 Since F(x y) is unknown one cannot estimate this minimum directly. To solve this problem we instead use the leave-one-out procedure which is an almost unbiased estimator of (12). The leave-one-out error of an algorithm on the training sample Sf. is The leave-one-out procedure consists of removing from the training data one element (say (Xi Yi)) constructing the regression function only on the basis of the remaining training data and then testing the removed element. In this fashion one tests all f elements of the training data using f different decision rules. The minimum over of (13) we consider as the minimum over of (12) since the expectation of (13) coincides with (12) [2]. For Ridge Regression one can derive a dosed form expression for the leave- one- out error. Denoting (14) the error incurred by the leave-one-out procedure is [6] 1 f. (Y'_kTA-1KTy)2 T. -_ ~ ~ 'Y loo(r) - f L 1 _ kt A-1k. ~=1 ~ 'Y ~ kt = (i>i(xd l/>n(xt)f Let = 0 be the minimum of (15). Then the vector yo = K*(KT K +0 I)-I KTy I/>(XHI) K*- (. 1/>1 (XHm) is the Ridge Regression estimate of the unknown values (Ye+l'... Ye+m)' 3 Leave-One-Out Error for Transductive Inference In transductive inference our goal is to find an algorithm A which minimizes the functional (4) using both the training data (1) and the test data (2). We suggest the following method: predict (Ye+l'... 'Ye+m) by finding those values which minimize the leave-one-out error of Ridge Regression training on the joint set (13) (15) (16) (17) (18) (Xl yd. (Xl Yl) (Xl+l ye+l). (XHm Ye+m)' (19) This is achieved in the following way. Suppose we treat the unknown values (Ye+l".. Ye+m) as variables and for some fixed value of these variables we minimize the following empirical functional f: f. Hm ) Remp(aly;.. y~) = m ( ~(Yi - f(xia))2 +. L (y; - f(xi a))2 +llai1 2. ~=l ~=l+1 (20) This functional differs only in the second term from the functional (8) and corresponds to performing Ridge Regression with the extra pairs (21)

424 O. Chapel/e V. N. Vapnik and J. Weston Suppose that vector Y" = (Yi... y:n) is taken from some set Y" E Y such that the pairs (21) can be considered as a sample drawn from the same distribution as the pairs (Xl yi)... (Xl yi) In this case the leave-one-out error of minimizing (20) over the set (19) approximates the functional (4). We can measure this leaveone-out error using the same technique as in Ridge Regression. Using the closed form (15) one obtains 1 l+m (Y:' _ kt A-I kty) 2 7loo(rly~.. y~) = -f-- L ~ t~t ~-1~ + m i=l 1 - k i A-y ki (22) we denote x = (Xl... Xl+ m) Y = (YI... Yl Yi+1".. Yi+m)T and (23) Kij=<pj(Xi) i=i... i+m j=i... n. kt = (<PI(Xt}... <Pn(xt)f (24) (25) Now let us rewrite the expression (22) in an equivalent form to separate the terms with Y from the terms with x. Introducing (26) and the matrix M with elements l+m M.. _ " CikCkj tj - ~ 2 k=l Cu we obtain the equivalent expression of (22) (27) (28) In order for the Y" which minimize the leave-one-out procedure to be valid it is required that the pairs (21) are drawn from the same distribution as the pairs (Xl yi)... (Xl yi) To satisfy this constraint we choose vectors Y" from the set Y = {Y" : IIY" - y011 -s: R} the vector yo is the solution obtained from classical Ridge Regression. To minimize (28) under constraint (29) we use the functional (29) (30) 'Y" is a constant depending on R. Now to find the values at the given points of interest (2) all that remains is to find the minimum of (30) in Y". Note that the matrix M is obtained using only the vectors X. Therefore to find the minimum of this functional we rewrite Equation (30) as (32)

Transductive Inference for Estimating Values of Functions 425 and Mo is a e x e matrix Ml is a e x m matrix and M2 is a m x m matrix. Taking the derivative of (31) in y* we obtain the condition for the solution which gives the predictions 2M1Y + 2M2Y* - 2)*Yo + 2)*Y* = 0 (33) Y* = ()'* 1+ M 2)-1 (-MIY + )*y O ). (34) In this algorithm (which we will call Transductive Regression) we have two parameters to control: )' and )'*. The choice of )' can be found using the leave-one-out estimator (15) for Ridge Regression. This leaves )'* as the only free parameter. 4 Experiments To compare our one- step transductive approach with the classical two- step approach we performed a series of experiments on regression problems. We also describe experiments applying our technique to the problem of pattern recognition. 4.1 Regression We conducted computer simulations for the regression problem using two datasets from the DELVE repository: boston and kin-32th. The boston dataset is a well- known problem one is required to estimate house prices according to various statistics based on 13 locational economic and structural features from data collected by the U.S Census Service in the Boston Massachusetts area. The kin-32th dataset is a realistic simulation of the forward dynamics of an 8 link all-revolute robot arm. The task is to predict the distance of the end-effector from a target given 32 inputs which contain information on the joint positions twist angles and so forth. Both problems are nonlinear and contain noisy data. Our objective is to compare our transductive inference method directly with the inductive method of Ridge Regression. To do this we chose the set of basis functions i(x) = exp (-llx - xii 12/2(2) i = 1... e and found the values of )' and a for Ridge Regression which minimized the leave-one-out bound (15). We then used the same values of these parameters in our transductive approach and using the basis functions i(x) = exp (-llx - XiW /2(72) i = 1... e + m we then chose a fixed value of)'*. For the boston dataset we followed the same experimental setup as in [4] that is we partitioned the training set of 506 observations randomly 100 times into a training set of 481 observations and a testing set of 25 observations. We chose the values of)' and a by taking the minimum average leave- one-out error over five more random splits of the data stepping over the parameter space. The minimum was found at )' = 0.005 and log a = 0.7. For our transductive method we also chose the parameter)* = 10. In Figure la we plot mean squared error (MSE) on the test set averaged over the 100 runs against log a for Ridge Regression and Transductive Regression. Transductive Regression outperforms Ridge Regression especially at the minimum. To observe the influence of the number of test points m on the generalization ability of our transductive method we ran further experiments setting )'* = e/2m for

426 0. Chapel/e V. N. Vapnik and J. Weston 9.2 9 8.8 1- Tranaductive Regression 1 - _. Ridge Regress.n " 7.95 7.9 8r---~----~----~--------- 8.6 8.4 8.2 g7.85 UJ W ~ 7.8 8 7.75 7.8 7.7 7.6 015 0.4 0.14 0.13 g ~0. 12 Q) I- 0.11 0.1 0.6 0.8 Log sigma (a) 1- Transductive Regression 1 - - - RIdge RegreSSion / :' - --- " 1.2 7.65;L--~5:--=~10===-15==:"2=0-----:!25 0.135 Test Set Size (b) O.13r---~--=--- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0.125 ~ 0.12 t: ~0. 115 ;l I- 0.11 0.105 0.1 0.095 0.09 1 1.5 2 2.5 Log sigma (c) 3 0.09''-------:'':50-----.. 10~0----1~50------..200~---:2=-=-' Test Set Size (d) Figure 1: A comparison of Transductive Regression to Ridge Regression on the boston dataset: (a) error rates for varying (J' (b) varying the test set size m and on the kin-32fh dataset: (c) error rates for varying (J' (d) varying the test set size. different values of m. In Figure 1b we plot m against MSE on the testing set at log (J' = 0.7. The results indicate that increasing the test set size gives improved performance in Transductive Regression. For Ridge Regression of course the size of the testing set has no influence on the generalization ability. We then performed similar experiments on the kin-32fh dataset. This time as we were interested in large testing sets giving improved performance for Transductive Regression we chose 100 splits we took a subset of only 64 observations for training and 256 for testing. Again the leave-one-out estimator was used to find the values = 0.1 and log (J' = 2 for Ridge Regression and for Transductive Regression we also chose the parameter * = 0.1. We plotted MSE on the testing set against log (J' (Figure 1c) and the size of the test set m for log (J' = 2.75 (also * = 50/m) (Figure 1d) for the two algorithms. For large test set sizes our method outperforms Ridge Regression. 4.2 Pattern Recognition This technique can also be applied for pattern recognition problems by solving them based on minimizing functional (8) with y = ±1. Such a technique is known as a Linear Discriminant (LD) technique.

Transductive Inference for Estimating Values of Functions 427 AB ABR SVM TLD Postal - - 5.5 4.7 Banana 12.3 10.9 11.5 1l.4 Diabetes 26.5 23.8 23.5 23.3 Titanic 22.6 22.6 22.4 22.4 Breast Cancer 30.4 26.5 26.0 25.7 Heart 20.3 16.6 16.0 15.7 Thyroid 4.4 4.6 4.8 4.0 Table 1: Comparison of percentage test error of AdaBoost (A B) Regularized AdaBoost (ABR) Support Vector Machines (SVM) and Tmnsductive Linear Discrimination (TLD) on seven datasets. Table 1 describes results of experiments on classification in the following problems: 2 class digit recognition (0-4 versus 5-9) splitting the training set into 23 runs of 317 observations and considering a testing set of 2000 observations and six problems from the UCI database. We followed the same experimental setup as in [3]: the performance of a classifier is measured by its average error over one hundred partitions of the datasets into training and testing sets. Free parameter(s) are chosen via validation on the first five training datasets. The performance of the transductive LD technique was compared to Support Vector Machines AdaBoost and Regularized AdaBoost [3]. It is interesting to note that in spite of the fact that LD technique is one of the simplest pattern recognition techniques transductive inference based upon this method performs well compared to state of the art methods of pattern recognition. 5 Summary In this article we performed transductive inference in the problem of estimating values of functions at the points of interest. We demonstrate that estimating the unknown values via a one- step (transductive) procedure can be more accurate than the traditional two-step (inductive plus deductive) one. References [1] A. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12(1):55-67 1970. [2] A. Luntz and V. Brailovsky. On the estimation of characters obtained in statistical procedure of recognition. Technicheskaya Kibernetica 1969. [In Russian]. [3] G. Witsch T. Onoda and K.-R. Muller. Soft margins for adaboost. Technical report Royal Holloway University of London 1998. TR-98-2l. [4] C. Saunders A. Gammermann and V. Vovk. Ridge regression learning algorithm in dual variables. In Proccedings of the 15th International Conference on Machine Learning pages 515-52l. Morgan Kaufmann 1998. [5] V. Vapnik. Estimating of values of regression at the point of interest. In Method of Pattern Recognition. Sovetskoe Radio 1977. [In Russian]. [6] V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer Verlag New York 1982.