Robust Minimax Probability Machine Regression

Size: px

Start display at page:

Download "Robust Minimax Probability Machine Regression"

Magnus Bell
5 years ago
Views:

1 obust Minimax Probability Machine egression obust Minimax Probability Machine egression Thomas. Strohmann Gregory Z. Grudic Department of Computer Science University of Colorado Boulder, C , USA strohman@cs.colorado.edu grudic@cs.colorado.edu Abstract We formulate regression as maximizing the minimum probability (Ω) that the true regression function is within ±ɛ of the regression model. Our framework starts by posing regression as a binary classification problem, such that a solution to this single classification problem directly solves the original regression problem. Minimax probability machine classification (Lanckriet et al., a) is used to solve the binary classification problem, resulting in a direct bound on the minimum probability Ω that the true regression function is within ±ɛ of the regression model. This minimax probability machine regression (MPM) model assumes only that the mean and covariance matrix of the distribution that generated the regression data are known; no further assumptions on conditional distributions are required. Theory is formulated for determining when estimates of mean and covariance are accurate, thus implying robust estimates of the probability bound Ω. Conditions under which the MPM regression surface is identical to a standard least squares regression surface are given, allowing direct generalization bounds to be easily calculated for any least squares regression model (which was previously possible only under very specific, often unrealistic, distributional assumptions). We further generalize these theoretical bounds to any superposition of basis functions regression model. Experimental evidence is given supporting these theoretical results. Keywords: Minimax Probability Machine, egression, obust, Kernel Methods, Bounds, Distribution free. Introduction In this paper we formulate the regression problem as one of maximizing the probability, denoted by Ω, that the predicted output will we within some margin ɛ of the true regression function. We show how to compute a direct estimate of this probability for any given margin ɛ >. Following the idea of the minimax probability machine for classification (MPMC) due to Lanckriet et al. (a), we make no detailed distributional assumptions, but obtain a probability Ω that is a lower bound for all possible distributions with a known mean and covariance matrix. In other words, we maximize the minimum probability of our regression model being within ±ɛ correct on test data for all possible distributions that. Throughout this paper, the phenomenon that generated the training data is referred to as the true regression function (or surface), while the model that is constructed from the training data is referred to as the regression model.

2 Strohmann and Grudic have the same mean and covariance matrix as the distribution that generated the training data. We term this type of regression model as a minimax probability machine regression (MPM) (see Strohmann and Grudic, 3). Current practice for estimating how good a regression model is dictates that one has to either estimate the underlying distribution of the data or make Gaussian assumptions, both of which have their problems. Estimating high dimensional distributions from small samples is a very difficult problem that has not been solved satisfactory (see S.J. audys, 99). Gaussian Processes obtain regression models within a Bayesian framework (see Williams and asmussen, 995, asmussen, 996, Gibbs, 997a, Mackay, 997). As the name indicates they put a prior on the function space that is a generalization of a Gaussian distribution for (a finite number of) random variables. Instead of mean and covariance, Gaussian Processes use a mean function and a covariance function to express priors on the function space. By adopting these restrictions it is possible to obtain confidence intervals for a given regression estimate. In practice, however, the assumption that data is generated from underlying Gaussians does not always hold which can make the Gaussian Process confidence intervals unrealistic. Similar problems with unrealistic bounds occur in such algorithms as the relevance vector machine (Tipping, ) where approximations, which may be unrealistic, are made in obtaining these. Whereas as long as one can obtain accurate estimates of mean and covariance matrix our distribution free algorithm will always yield a correct lower bound estimate, no matter what the underlying distribution is. Because the accuracy of the proposed regression algorithm directly depends on the accuracy of mean and covariance estimates, we propose a theoretical framework for determining when theses estimates are accurate. This in turn determines when the estimate of the bound Ω is accurate. The general idea is to randomly discard a (small) subset of the training data and measure how sensitive the estimate of Ω is when we use the remaining data to estimate mean and covariance matrix. This approach has some resemblance to the ANSAC (ANdom SAmpling and Consensus) method, originally developed by Fischler and Bolles (98) for the task of image analysis. To solve the regression problem, we reduce it to a binary classification problem by generating two classes that are obtained by shifting the dependent variable ±ɛ (see Figure ). The regression surface is then interpreted as being the boundary that separates the two classes. Although any classifier can be used to solve the regression problem in this way, we propose to use MPMC (Lanckriet et al., a) which gives a direct bound on Ω. One result of this is paper is a theorem stating that when MPMC is used to solve the regression problem as described above, the regression model is identical to the one obtained by standard least squares regression. The importance of this result is that it is now easy to calculate a generalization bound for the least squares method. This was previously only possible under strict, often unrealistic, distributional assumptions. The paper is organized as follows: In section we give a strict mathematical definition of our approach to the regression problem and describe our hypothesis space of functions for linear and nonlinear regression surfaces. Section 3 outlines the statistical foundation of the minimax probability machine, based on a theorem due to Marshall and Olkin (96) that was later on extended by Popescu and Bertsimas (). We then show how any regression problem can be formulated as a binary classification problem by creating two symmetric classes: one has the output variable shifted by +ɛ, the other one has it shifted by ɛ (see

3 obust Minimax Probability Machine egression Figure ). Following this scheme we conclude the section by deriving the linear MPM ε 4 3 u.5 ε.5 v classification boundary = regression surface a) regression problem b) artificial classification problem Figure : Formulating regression as binary classification from the linear MPMC and demonstrating that due to the symmetry it is computationally easier to find an MPM regression model than to find an MPM classifier. In section 4 we describe how to obtain nonlinear regression surfaces by using any set of nonlinear basis functions and then focus on using nonlinear kernel maps as a distance metric on pairs of input vectors. The problem of overfitting the training data with highly sensitive nonlinear kernel maps is addressed in section 5. There we propose a robustness measure that reflects how sensitive our estimate of Ω is with respect to the estimates of mean and covariance (which are the only estimates MPM relies upon). Section 6 contains the experiments we conducted with our learning algorithm on toy problems and real world data sets. We start out with a simple toy example and show that for certain distributions the Gaussian Process error bounds fail while the ones obtained by MPM still hold. For the real world data sets we focussed on investigating the accuracy of Ω as a lower bound probability on ±ɛ-accurate predictions and its relation to our robustness measure. The last section summarizes the main contributions of this paper and points out some open research questions concerning the MPM framework. A Matlab implementation of the MPM algorithm can be obtained from: strohman/papers.html. egression Model Formally, MPM addresses the following problem: Let f : d be some unknown regression function (i.e. the phenomenon that generated the learning data), let the random vector x d be generated from some bounded distribution that has mean x and covariance Σ x which we will write from now on as x ( x, Σ x ) and let our training data be generated according to: y = f (x) + ρ 3

4 Strohmann and Grudic where ρ is a noise term with an expected value of E[ρ] = ρ = and some finite variance V ar[ρ] = σ ρ. Given a hypothesis space H of functions from d to, we want to find a model ˆf H that maximizes the minimum probability of being ±ɛ accurate. We define: Ω f = inf P r{ f(x) y < ɛ} () (x,y) ( x,ȳ,σ) where Σ = cov ( x, x,... x d, y ) with Σ (d+) (d+) is the complete covariance matrix of the random vector x and the random variable y = f (x) + ρ (see equation () for details). The only assumptions we make about the first and second moments of the underlying distribution is that x, ȳ and Σ are finite. No other distributional assumptions are made. For any function f H, the model ˆf that we are looking for has to satisfy: Ω ˆf Ω f for all f H For linear models, H contains all the functions that are linear combinations of the input vector: d f(x) = β j x j + β = β T x + β (where β d, β ) () j= For nonlinear models we consider the general basis function formulation: f(x) = k β i Φ i (x) + β (3) i= In this paper we focus on the commonly used kernel representation K : d d (see e.g. Smola and Schölkopf, 998) where the hypothesis space consists of all linear combinations of kernel functions with the training inputs as their first arguments (i.e. Φ i (x) = K(x i, x)): N f(x) = β i K(x i, x) + β i= where x i d denotes the ith input vector of the training set Γ = {(x, y ),..., (x N, y N )}. In the nonlinear formulation we consider the distribution of the random vector of basis functions Φ = (Φ, Φ,..., Φ N ) to formulate the MPM bound Ω: Ω f = inf P r{ f(x) y < ɛ} (Φ,y) ( Φ,ȳ,Σ) where Σ = cov ( y, Φ, Φ,... Φ N ) see equation () for details on calculating Σ. 4

5 obust Minimax Probability Machine egression 3. Linear Framework In this section we will develop the linear MPM by reducing the regression problem to a single (specially structured) binary classification problem. We will also show that the regression estimator of the linear MPM is equivalent to the one obtained by the method of least squares. The terms random vector and class will be used interchangeably in a sense that all examples of a class are assumed to be generated from its associated random vector. 3. Minimax Probability Machine Classification (MPMC) The minimax binary classifier for linear decision boundaries can be formulated as a hyperplane that maximizes the minimum probability of correctly classifying data points generated from the two underlying random vectors (Lanckriet et al., a,b). Assume we have random vectors u and v that are drawn from two probability distributions characterized by (ū, Σ u ) and ( v, Σ v ), the linear MPMC algorithm solves the following optimization problem: max Ω s.t. inf P Ω,a,b u (ū,σ u ) r{at u b} Ω inf P v ( v,σ v ) r{at v b} Ω (4) A theorem due to Marshall and Olkin (96) that was later on extended by Popescu and Bertsimas () states a formula for the supremum probability that a random vector lies on one side of a hyperplane: sup P r{a T v b} = v ( v,σ v) + δ, with δ = inf (w v) T Σ a T v (w v) w b Where the infimum can be computed analytically from the hyperplane parameters a and b and the estimates v, Σ v (Lanckriet et al., b). This result can be used to simplify the optimization problem (4) to the following one: m = (min a a T Σ u a + a T Σ v a) s.t. a T (ū v) = (5) The offset b of the hyperplane is uniquely determined for any a: a b = a T ū T Σ u a a = a T v + T Σ v a m m Once a minimum value m is obtained for (5), the probability bound Ω can be computed as: Ω = m + 3. Formulating egression as Classification In a linear regression model we want to approximate f : d with a linear estimator, i.e. a function of the form: ŷ = ˆf(x) = β T x + β To obtain a minimax regression model i.e. one which maximizes the minimum probability that future points are predicted within ±ɛ correctly we use the MPMC framework in (6) 5

6 Strohmann and Grudic the following way. Each training data point (x i, y i ) for i =,..., N turns into two d + dimensional vectors, one is labelled as class u (and has the y value shifted by +ɛ) the other one as class v (and has the y value shifted by ɛ, see also Figure ): u i = (y i + ɛ, x i, x i,..., x id ) v i = (y i ɛ, x i, x i,..., x id ) i =,..., N i =,..., N (7) It is interesting to note that we could use any binary classification algorithm to solve this artificial classification problem. The boundary obtained by the classifier turns directly into the regression surface one wants to estimate. For example one could even use decision trees as underlying classifier, determine where the decision tree changes its classification, and use this boundary as the regression output. In this paper we focus on the consequences of using MPMC as the underlying classifier for the problem defined by (7). Once the hyperplane parameters a and b have been determined by the MPMC, we use the classification boundary a T z = b to predict the output ŷ for a new input x (where z is defined as z = (ŷ, x, x,..., x d )): a T z = b (classification boundary) a ŷ + a x + a 3 x a d+ x d = b ŷ = a a x a 3 a x... a d+ a x d + b a (a ) ŷ = β x + β x β d x d + β = β T x + β (regression function) (8) The symmetric ±ɛ shift in the output direction has a number of useful properties. First of all, the probability bound Ω of correct classification is equivalent to the probability that future predictions will be within ±ɛ accurate. oughly speaking, if a hyperplane classifies a point of class u correctly, then the corresponding y value can be at most ɛ below the hyperplane. Similarly, for points of class v that are classified correctly, the corresponding y value can be at most ɛ above the hyperplane (see the next section for a rigorous proof). Second, the artificial classification problem has a much simpler structure than the general classification problem. In particular, we find that Σ u = Σ v =: Σ and ū v = (ɛ,,..., ) T. This allows us to simplify (5) even further, we can minimize a T Σ a or equivalently a T Σ a with respect to a. The next section shows how to solve this minimization problem by just solving one linear system. 3.3 Linear MPM Given an estimate Σ = Σ u = Σ v d+ d+ of the covariance matrix for the random vector (y, x) that generated the data, we can simplify (5): min a a T Σ u a + a T Σ v a s.t. a T (ū v) = min a a T Σ a s.t. a T (ɛ,,..., ) T = Since we are only interested in the minimum we can get rid of the square root and the factor of. The goal of the linear MPM can then be formulated as the solution to the following constrained optimization problem for a d+ : min a a T Σ a s.t. (ɛ,,..., )a = (9) 6

7 obust Minimax Probability Machine egression We use Lagrangian multipliers to turn the problem into an unconstrained optimization problem: L(a, λ) = a T Σ a + λ((ɛ,,..., )a ) By setting the derivatives L(a,λ) a = and L(a,λ) λ = we obtain the following system of linear equations which describe a solution of (9): Σ a + (λɛ,,..., ) T = () and a = ɛ () where Σ = σ yy σ yx σ yxd σ x y σ x x σ x x d σ xd y σ xd x σ xd x d () is the full sample covariance matrix for the training data Γ = {(x, y ),..., (x N, y N )}. The entries of Σ are estimated as follows: x j = N N i= x ij j =,..., d ȳ = N N i= y i σ yy = N N i= (y i ȳ) σ xj y = N N i= (x ij x j )(y i ȳ) = σ yxi j =,.., d σ xj x k = N N i= (x ij x j )(x ik x k ) j =,..., d; k =,.., d The system () has d + equations and d + unknowns: a, a,..., a d+ and the Lagrange multiplier λ (also note that () guarantees the requirement a of (8) for an arbitrary ɛ > ). For any values a, a,..., a d+, the first equation of () can always be solved by setting the free variable λ appropriately: λ = (Σa) ɛ We write out the other d equations of () and omit the factor of, i.e.: σ x y σ x x σ x x d a σ x y σ x x σ x x d a = σ xd y σ xd x σ xd x d a d+ ecall that we are interested in the coefficients β j which can be written as β j = a j+ a for j =,..., d (see (8)). Therefore we rearrange the system above in terms of β j (by multiplying out the σ xj ya terms, bringing them to the other side, and dividing by a ): a σ x y a σ x y a σ xd y + σ x x σ x x σ x x d σ x x σ x x σ x x d σ xd x σ xd x σ xd x d 7 a a 3 a d+ =

8 Strohmann and Grudic becomes σ x x σ x x σ x x d σ x x σ x x σ x x d σ xd x σ xd x σ xd x d a a 3 a d+ = a σ x y a σ x y a σ xd y and finally (divide by a and substitute β j = a j+ a for j =,..., d) σ x x σ x x σ x x d σ x x σ x x σ x x d σ xd x σ xd x σ xd x d β β β d = σ x y σ x y σ xd y (3) One can show (see Appendix A) that this is exactly the same system which appears for solving linear least squares problems. Consequently the offset β should also be equivalent to the offset β = ȳ β T x of the least squares solution. To see why this is true, we start by looking at the offset b of the hyperplane: Therefore, we have: b = a T ū a T Σ a a T Σ a+ a T Σ a = ( ɛ, β ɛ,..., β d ɛ )(ȳ + ɛ, x,..., x d ) T = ɛ (ȳ + ɛ d i= β i x i ) ɛ ɛ = ɛ (ȳ d i= β i x i ) from (6) β = b a = b ɛ = ȳ From (3) and (4) we obtain our first theorem: d β j x j = ȳ β T x (4) Theorem The linear MPM algorithm computes the same parameters β, β, β,..., β d as the method of least squares. j= Proof. Derivation of (3) and (4). We have shown that for the linear case the regression estimator we obtain from MPM is the same one gets from least squares. The MPM approach, however, allows us to compute a distribution free lower bound Ω on the probability that future data points are predicted within ±ɛ correctly. We don t know of any other formulation that yields a bound like that for the commonly used least squares regression. To actually compute the bound Ω, we estimate the covariance matrix Σ from the training data and solve (3) for β. Then, for any given margin ɛ > we set a = ɛ, a j+ = β j a (j =,..., d) and calculate: Ω = 4a T Σ a + The following theorem establishes that the Ω claculated in (5) is a lower bound on the probability of future predictions being ±ɛ accurate (if we have perfect estimates for x, ȳ, Σ). (5) 8

9 obust Minimax Probability Machine egression Theorem For any distributions Λ x with E[Λ x ] = x, V [Λ x ] = Σ x and Λ ρ with E[Λ ρ ] =, V [Λ ρ ] = σ (where x, Σ x, σ are finite) and x = (x,..., x d ) Λ x, ρ Λ ρ let y = f (x) + ρ be the (noisy) output of the true regression function. Assume perfect knowledge of the statistics x, ȳ, Σ (where Σ is defined in ()). Then the MPM model ˆf : d satisfies: Proof. inf P r{ ˆf(x) y < ɛ} = Ω (x,y) ( x,ȳ,σ) For any d + dimensional vector z = (y, x,..., x d ) we have: z is classified as class u y > ˆf(x) z is classified as class v y < ˆf(x) (this follows directly from the construction of the d + dimensional hyperplane). Now look at the random variables z +ɛ = (y + ɛ, x) and z ɛ = (y ɛ, x). The distribution of z +ɛ satisfies z +ɛ ((ȳ + ɛ, x), Σ) which has the same mean and covariance matrix as our (artificial) class u (ū, Σ u ). Similarly, we find that z ɛ ( v, Σ v ). The MPM classifier guarantees that Ω is a lower bound of correct classification. Formally, correct classification is the probabilisti event that z +ɛ is classified as u and that at the same time z ɛ is classified as v. If we use the two equivalences above we obtain: P r{z +ɛ classified as u z ɛ classified as v} = P r{y + ɛ > ˆf(x) y ɛ < ˆf(x)} = P r{ ˆf(x) y < ɛ} Since Ω is the infimum probability over all possible distributions for z +ɛ and z ɛ it is also the infimum probability over the equivalent distributions for x and y. This completes the proof that Ω is a lower bound probability for the MPM model being ±ɛ accurate. It is interesting to note that once we have calculated a model we can also apply (5) in the other direction, i.e. for a given probability Ω [, ) we can compute the margin ɛ such that Ω will be a lower bound on the probability of test points being ±ɛ accurate. Simple algebra shows that 4a T Σ a = (σ yy d j= β jσ yxj )/ɛ : 4 (a, a,, a d+ ) σ yy σ yx σ yxd σ x y σ x x σ x x d σ xd y σ xd x σ xd x d We substitute a = ɛ and a j+ = β j a for j =,..., d: σ yy σ yx σ yxd 4 (/ɛ, β /ɛ,, β d /ɛ) σ x y σ x x σ x x d σ xd y σ xd x σ xd x d a a a d+ /ɛ β /ɛ β d /ɛ We factor our /ɛ from each of the two vectors (the 4 cancels out with ( ) ): σ yy σ yx σ yxd ɛ (, β,, β d ) σ x y σ x x σ x x d β σ xd y σ xd x σ xd x d β d 9

10 Strohmann and Grudic We do the matrix vector multiplication on the right side. σ xi y (i =,..., d) from (3): σ yy d j= ɛ (, β σ yx j β j,, β d ) σ x y σ x y σ xd y σ xd y Note that d j= σ x i x j β j = The last d components of the column vector are zero, hence the expression simplifies to: d σ ɛ yy β j σ yxj j= The numerator ν = σ yy d j= β jσ yxj establishes the relation between Ω and ɛ. It is important to note that the lower bound Ω is also valid for linear and nonlinear (see section 4.) regression models generated from other learning algorithms. However, for other models β may not satisfy (3), i.e. the last d components don t zero-out and we have to compute the whole expression: ν = (, β,, β d ) σ yy σ yx σ yxd σ x y σ x x σ x x d σ xd y σ xd x σ xd x d β β d (6) If we set β according to (4) we can, for example, compute distribution free error bounds of an SVM for any given confidence Ω. We summarize these results in the following theorem: Theorem 3 For any linear regression model f(x) = d j= β jx j + β we have the following relationships between the distribution free lower bound probability Ω and the margin size ɛ: Ω = ν/ɛ + Ω ɛ = ν Ω where ν is defined in (6). Proof. Combining (5) and (6) we conclude that we can write Ω as: Ω = 4a T Σ a + = ν/ɛ + for any basis function regression model (which completes the first part of the proof). If we solve this equation for ɛ we obtain: Ω = ν/ɛ + ( ν/ɛ + )Ω = ( ν + ɛ )Ω = ɛ νω = ɛ ( Ω) ɛ = ν Ω Ω

11 obust Minimax Probability Machine egression 4. Nonlinear Framework We now extend the above linear regression formulation to nonlinear regression using the basis function formulation: k f(x) = β i Φ i (x) + β where the basis functions Φ i : d can be arbitrary nonlinear mappings. 4. Kernel Maps as Features i= In this paper we focus our analysis and experiments on Mercer kernels and use the kernel maps K(x i, ) that are induced by the training examples as a basis. Instead of d features each input vector is now represented by N features; the kernel map evaluated at all of the other training inputs (including itself). f(x) = N β i K(x i, x) + β i= We then solve the regression problem with the linear MPM, i.e. we minimize the following least squares expression: where min β N N (y i ( β j z ij + β )) i= z ij = K(x i, x j ) j= i =,..., N; j =,..., N A new input x is then evaluated as: ŷ = ˆf(x) = N β i K(x i, x) + β i= Commonly used kernel functions include the linear K(s, t) = s T v, the polynomial K(s, t) = (s T v + ) p and the Gaussian kernel K(s, t) = exp( (s t)t (s t) ) (see for example Schölkopf σ and Smola, ). Note that the learning algorithm now has to solve an N-by-N system instead of a d-by-d system: σ z z σ z z σ z z N σ z z σ z z σ z z N σ zn z σ zn z σ zn z N β β β N = σ z y σ z y σ zn y and setting N β = ȳ β j z j i=

12 Strohmann and Grudic where ȳ = N N i= y i z j = N N i= z ij j =,..., N σ zj y = N N i= (z ij z j )(y i ȳ) = σ yzi j =,.., N σ zj z k = N N i= (z ij z j )(z ik z k ) j =,..., N; k =,.., N The proof of Theorem 3 generalizes in a straightforward way to the basis function framework: Theorem 4 For any basis function regression model f(x) = N i= β iφ i (x) + β we have the following relationships between the distribution free lower bound probability Ω and the margin size ɛ: Ω = ν/ɛ + Ω ɛ = ν (7) Ω where σ yy σ yz σ yzn ν = (, β,, β N ) σ z y σ z z σ z z d β σ zn y σ zn z σ zn z N β N Proof. Analogous to proof of Theorem A Different Approach In previous work we proposed a different nonlinear MPM formulation which is based on the nonlinear MPM for classification by Lanckriet et al. (a). Instead of using kernel maps as features, this formulation finds a minimax hyperplane in an higher dimensional space. Like kernelized support vector machines, one considers the (implicit) mapping ϕ : d h and its dot products ϕ(x i ) T ϕ(x j ) =: K(x i, x j ). This MPM algorithm (we call it kernel- MPM since it requires Mercer kernels) was also obtained by using the MPM classifier on an artificial data set where the dependent variable, shifted by ±ɛ, is added as first component to the input vector (yielding the d+-dimensional vector (y, x,..., x d )). The kernel-mpm model looks like: N γ i K(z i, z) + γ = (8) i= where z i = (y i + ɛ, x i,..., x id ) for i =,..., N and z i = (y i N ɛ, x i N,,..., x i N,d ) for i = N +,..., N. Evaluating a new point x amounts to plugging in z = (y, x) into (8) and solving the equation for y. For an arbitrary kernel this would imply that we have to deal with a nonlinear equation which is at best computationally expensive and at worst not solvable at all. Therefore, the kernel is restricted to be output-linear, i.e. we require: K((s, s,..., s d+ ), (t, t,..., t d+ )) = s t + K ((s,..., s d+ ), (t,..., t d+ )) (9) The theory of positive definite kernels (see for example Schölkopf and Smola, ) states that positive definite kernels are closed under summation, which means that we could use

13 obust Minimax Probability Machine egression an arbitrary positive kernel for K and obtain a valid kernel K (of course the linear output kernel is also positive definite). By using a kernel K of the form (9), we can analytically solve (8) for y and obtain the regression model: [ ] N y = ( ɛ) ( (γ i + γ i+n )K (x i, x)) + γ i= If we compare the two different MPM regression algorithms we note that the computational cost of learning an kernel-mpm model is bigger because instead of an N N system we have to solve an N N system to obtain the coefficients γ. For MPM we are not restricted to Mercer kernels but can use any set Φ i of nonlinear basis functions. It is an open question as to whether MPM and kernel-mpm are equivalent if we use Mercer kernels for the MPM. 5. obust Estimates The MPM learning algorithm estimates the covariance matrix from the training data. Of course, these statistical estimates will be different from the true values and thus the regression surface itself and the lower bound probability Ω may be inaccurate. Especially for the nonlinear case the problem of overfitting the training data with highly sensitive kernel functions arises, so our goal is to find kernels that produce robust estimates. One way of dealing with the problem of selecting the right kernel (i.e. model) is to do cross validation on the training data and pick the kernel which yields the best result. In this paper we explore a different approach which is computationally much more efficient. The idea is to randomly select a fraction α of rows that will be left out when calculating the covariance matrix (note that we measure the covariance columnwise, i.e. deleting rows will not affect the dimension of the covariance matrix). If the covariance matrix of the remaining training data is reasonably close to the covariance matrix of all the training data, we consider our model to be robust. We now define in mathematical terms what reasonably close means. The objective of the MPM algorithm is to maximize the lower bound probability Ω which is determined by the term a T Σ a. Therefore, we measure the sensitivity of an MPM model by looking at the sensitivity in a T Σ a. We call a model α-robust if the relative error for this expression is α when leaving a fraction of α points out for computing the covariance matrix: Definition Σ - true covariance matrix a - minimizes a T Σ a s.t. a = ɛ Σ - cov. matrix of all N examples a - minimizes a T Σ a s.t. a = ɛ Σ α - cov. matrix of ( α)n examples a α - minimizes a T ασ α a α s.t. a α = ɛ A model is α robust : at α Σ a α a T Σ a a T α Σ a α α We can not apply this definition directly since we don t know the true covariance matrix Σ but we will prove that under the following two assumptions we can obtain a formula that 3

14 Strohmann and Grudic does not involve Σ. ) a T Σ a a T ασ a α (whole model is more accurate than ( α)n model) ) at Σa a T α Σa α at Σ a a T Σ a (for α small a α is closer to a than a is to a ) Under these assumptions the following theorem holds: Theorem 5 If a model is α-robust then the ratio α : at α Σa α a T Σa a ασa α. In formulas: a T ασ a α a T Σ a a T ασ a α α = α : at ασ a α a T Σ a a T ασ a α () Proof of Theorem: from ) at Σ a a T α Σ a α at Σ a a T Σ a from ) a T Σ a a T ασ a α at Σ a aσ T a from premise) at α Σ a α a T Σ a a T α Σ a α combine inequalities: a T Σ a a T α Σ a α a T Σ a a T α Σ a α at Σ a a T Σ a at Σ a a T α Σ a α α at Σ a a T α Σ a α a Σ a a α Σ a α α at Σ a a T α Σ a α α α α α at α Σ a α a T Σ a a T α Σ a α α : at α Σ a α a T Σ a a T α Σ a α We refer to the number = α : at α Σ a α a T Σ a a T α Σ a α as robustness measure or confidence measure. In our experiments we apply the theorem to show that a model is not α-robust. A model where we find that < will be rejected because of too much sensitivity in Ω. A different approach for constructing a robust minimax probability machine has been investigated by Lanckriet et al. (b) for the task of classification. There the main idea is to put some error bounds on the plug-in estimates of mean and covariance. Instead of a single mean, the robust classifier now considers an ellipsoid of means which is centered around the estimated mean and bound in size by an error bound parameter ν. Similarly, instead of a single covariance matrix one now has a matrix ball, where each matrix is at most ρ away (measured in Frobenius norm) from the estimated covariance matrix. When a robust model is calculated, the lower bound probability Ω (for correct classification) decreases when either ν or ρ increase. The model itself is independent of ν, but it can depend on ρ and a small toy example shows (see Lanckriet et al., b, p. 578) that a robust model (ρ > ) can yield a better performance than a simple one (where ρ = ). From a theoretical point of view the two approaches differ in a way that our method is output oriented (it considers the sensitivity of Ω which is the objective that an MPM maximizes) whereas the other method is more input oriented (it considers the sensitivity of the plug-in estimates derived from the training data). Both approaches rely on sampling methods to make robust estimates: We sample a fraction ( α percent) of examples to determine the sensitivity of Ω, Lanckriet et al. (b) use sampling methods to determine reasonable values for the parameters ν and ρ. Our method, however, has the advantage 4

15 obust Minimax Probability Machine egression that it provides a straightforward way to do model selection: choose the model that has a maximal Ω while satisfying >. It is not immediately clear how a strategy for model selection can be implemented with the approach of Lanckriet et al. (b). It is an open question as to which practical applications either approach is preferable to the other. 6. Experimental esults In this section we present the results of the robust MPM for toy problems and real world regression ( benchmarks. For all of the data sets we used a Gaussian kernel K(x i, x j ) = exp x i x j to obtain a nonlinear regression surface. We plot the behavior of Ω and σ ) for a range of kernel parameters σ which is centered around σ with σ =.3 d where d is the input dimension of the data set. Each step, the parameter σ is increased by a factor of.. For all our experiments we used α =. as the fraction of points that are left out for the estimation of. To quantify how accurate our lower bound estimate Ω is, we compute the number of test points that are within ±ɛ correct (Ω T ) for each test run. The common behavior in all data sets is that small values for σ (which yield bad models due to overfitting) have a tiny value for, i.e. the corresponding model will be rejected. As σ increases, we obtain more robust estimates of the covariance matrix and therefore more robust MPM models that will not be rejected. The point of interest is where the robustness measure crosses because this will be the place where we become confident in our model and our Ω bound. 6. Toy Example We used the sinc function x sin(πx)/(πx) as a toy example for our learning algorithm. A noise term drawn from a Gaussian distribution with mean and variance σ =. was added to the examples. Both our training and test set consisted of N train = N test = examples (see Figure 3a). We tried 3 different values for σ and averaged over test runs for each σ. For both a fixed Ω and a fixed ɛ we observed that Ω became a true lower bound on Ω T right at the point where our confidence measure crossed (see Figures a,b). This toy experiment showed that the robustness ratio is suitable for measuring the robustness of Ω estimates. 6.. Comparison to Gaussian Processes In this section we also use the sinc function as toy data but add a noise variable ρ that is not drawn from a (single) Gaussian distribution. Instead, we generate a distribution from two separate Gaussians: With probability p =.5, ρ is drawn from N (µ, ) and with probability p =.5, ρ is drawn from N ( µ, ) (See figure 3b). We tested 3 values for µ in the range µ =.,., Both the training and test set had N train = N test = examples. The Gaussian Process implementation we used is provided by Gibbs (997b). We calculated the error bounds of the Gaussian Process for σ = one standard deviation. In the MPM framework this corresponds to a value of Ω =.689. By using (7) we computed. The interested reader is referred to Schölkopf et al. (997) and Schölkopf et al. (999) for a discussion on this particular choice for σ. 5

16 Strohmann and Grudic Ω T.8 Ω T.6.4. Ω ε.6.4. ε Ω a) for Ω = const and > we have Ω < Ω T b) for ɛ = const and > we have Ω < Ω T Figure : Toy example.5 Training data sinc function MPM model ± ε pdf(x) a) sinc function with Gaussian noise b) Two Gaussian noise for µ = x Figure 3: the MPM error bounds for each distribution. As expected,the Gaussian Process error bounds ɛ GP were tighter (i.e. more optimistic) than the worst case MPM error bounds ɛ MP M (see Figure 4a). For small values of µ where the noise distribution is still close to a (single) Gaussian the Gaussian Process error bounds are still accurate, i.e. they match to the measured percentage of ±ɛ accurate predictions. As µ increases and the noise distribution becomes less Gaussian, the Gaussian Process error bounds are no longer a good estimate on the prediction accuracy. The MPM error bounds, however, are valid for all possible values of µ (See figure 4b): 6

17 obust Minimax Probability Machine egression ε MPM.8 Ω T MPM.75.5 ε GP.7.65 Ω a) Error bounds µ.6.55 Ω T GP b) ±ɛ accurate test points µ Figure 4: Comparison MPM with Gaussian Processes 6. Boston Housing The first real world data set we experimented with is the Boston Housing data which consists of N = 56 examples, each with d = 3 real valued inputs and one real valued output. We divided the data randomly into a training set of size N train = 48 and a test set of size N test = 5. We tried 75 different values for σ, and for each of these we averaged our results over runs. For the first plot (Figure 5a) we fixed Ω =.5 and calculated the corresponding ɛ for the model. Additionally, we plotted the confidence measure and the percentage Ω T of test points within ±ɛ of our regression surface. As the number became bigger than, we noticed that Ω was a true lower bound, i.e. Ω < Ω T. The second plot (Figure 5b) has a fixed ɛ which was set to the value of ɛ where became greater than in the first plot. We recomputed Ω and Ω T and saw as before that Ω < Ω T for >. It is also remarkable that the best regression model (i.e. the one that maximizes Ω T for a given ɛ) is right at the point where crosses. The next plot (Figure 5c) shows the percentage of bad models (where Ω > Ω T ) which is below % for > and approaches zero as increases. This shows that the more robust a model is in terms of the more unlikely it is for Ω not to be a true lower bound on the probability of ±ɛ accurate prediction. The fourth plot (Figure 5d) shows false positives and false negatives for the robustness measure itself. Note that we have a high false negative rate shortly before crosses. The reason for this effect is that a model with, say =.8, will be rejected, but Ω < Ω T may still hold because Ω is a lower bound for all possible distributions, i.e. for specific distributions Ω is in general a pessimistic estimate. False positives are rare: when we accept a model because of > it is very unlikely (< 5%) that the model will be inaccurate in a sense that Ω T is smaller than Ω. Our last plot (Figure 5e) shows the connection of our regression approach to the more commonly used one which tries to minimize the (normed) mean squared error. We found that the normed MSE ( MSE model MSE mean ) has a minimum right at the point where crosses and Ω T had a maximum in the second plot. 7

18 Strohmann and Grudic ε.5 ε Ω T Ω Ω T Ω Ω > Ω T a) for > we have Ω T > Ω, i.e. Ω is a lower bound.9 b) for ɛ = const the maximal Ω T is around =.8 c) for > the number of bad models < %.8.7 false negative false positive NMMSE d) many false negatives (Ω is worst case bound) e) minimum MSE corresponds to maximum Ω T Figure 5: Boston Housing esults 6.3 Abalone The ABALONE data set contains N = 477 examples, each with 8 inputs and one realvalued output. One of the input features is not a real number but can take on one of three distinct values. We split up the discrete-valued feature into 3 real valued features, setting one component to and the other two to, which increases the input dimension to d =. The data was then divided randomly into N train = training examples and N test = 377 test examples. As in the Boston experiment we went over a range of kernel parameters (here 3) and averaged the results over a number of test runs (here 5). Again, we plotted ɛ for a constant Ω, Ω for a constant ɛ, the bad models, false positives/negatives for, and the normed mean squared error (see Figure 6). The overall behavior was the same that we observed for the Boston Housing data: Models with a confidence measure of > yielded accurate lower bounds Ω. The false positive and false negative rates were more extreme for this data set: We didn t get false positives anymore but on the other side the false negative percentage was very high for slightly less than. Also, models with a very insensitive kernel parameter tended to be about as good as models with a kernel parameter in the range of. We ascribe this effect to the high noise level of the Abalone data set which puts a bias towards insensitive models. 8

19 obust Minimax Probability Machine egression.5 ε ε Ω T Ω Ω T Ω Ω > Ω T a) b) c) false negative (no false positives) NMMSE d) e) Figure 6: Abalone esults 6.4 Kinematic Data The KIN8NM data set consists of N = 89 examples, each with 8 real valued inputs and one real valued output. We used the same parameters as in the ABALONE data set: N train = training examples, a range of 3 kernel parameters and 5 test runs to average the results for each kernel parameter. As in the previous experiments, the robustness measure provided a good estimate for the point where Ω becomes a lower bound on Ω T. For this data set, however, the maximum Ω T and the minimum MSE were not at the point where crossed, but occurred at a point where was still less than one (See Figure 7). 7. Conclusion & Future Work We have presented the MPM framework as a new approach to address regression problems where models are represented as nonlinear (or linear) basis function expansions (see () and (3)). The objective of MPM is to find a model that maximizes the worst case probability that future data is predicted accurately within some ±ɛ bound. Also, we introduced a general approach to reduce any regression problem to a binary classification problem by shifting the dependent variable both by +ɛ and by ɛ. In this paper, we used MPMC as underlying classifier which gives us a distribution free probability bound Ω, and therefore a regression algorithm that can estimate its ability to make ±ɛ correct predictions. emarkably, the resulting MPM is equivalent to the method of least squares which is a standard technique in statistical analysis. One consequence of our work is that it is now possible to 9

20 Strohmann and Grudic Ω > Ω T ε Ω T Ω ε Ω T Ω a) b) c) false positive.6.4 false negative.4 NMMSE d) e) Figure 7: Kinematic data esults calculate distribution free confidence intervals for least squares solutions (formerly this was only feasible under Gaussian or other specific, often unrealistic, distriubtional assumptions). For nonlinear problems we used Mercer kernels as basis functions for our regression model, however, our formulation is not limited to Mercer kernels but works with arbitrary basis functions. Typically, there are kernel parameters (e.g. the width σ of the Gaussian kernel) that need to be adjusted properly to obtain good models. A common technique to find these parameters is to do -fold cross validation which is computationally expensive. Our approach, however, does not involve cross validation but is based on a robustness measure. is calculated by sampling parts of the training data and looking on the effect on Ω the objective of the MPM. This is computationally more efficient since N- fold cross validation requires to build N models whereas we only need to build two models. Experiments with real world data sets have shown that the robustness measure works well for filtering out models that overfit the data and consequently violate the property of Ω as a true lower bound. Empirical evidence is given showing that can be used to find optimal kernel parameters and therefore is suitable for model selection. Future research on MPM will focus on investigating the possibility of doing model selection solely based on without any cross validation. As a consequence of Theorem 4 (Section 4.), we are looking into calculating lower bounds Ω for basis function regression models that are coming from other algorithms (e.g. the support vector machine regression described in Smola and Schölkopf (998)). In combination with the robustness measure (and possibly a model selection criterion) this idea could help to improve the accuracy of a whole class of regression algorithms. Also, the current implementation requires solving an

21 obust Minimax Probability Machine egression N N linear system of equations (complexity O(N 3 )) and therefore is prohibitive for large N (where N is the number of training examples). As a result we are currently working on Sparse MPM. Finally, we note that for N MPM converges to the true Ω (and the optimal model) since the estimates of mean and covariance matrix become more and more accurate as the number of training examples (N) increases. We are currently looking into the rate of convergence for Ω as N increases, which should be directly related to the rate of convergence for the mean and covariance estimates. A Matlab implementation of the MPM algorithm can be obtained from: strohman/papers.html 8. Acknowledgements We are grateful to Mark Gibbs for providing an implementation to do regression estimation with Gaussian Processes. 9. Appendix A The least squares problem can be formulated as the following linear system (see for example Wakcerly et al.,, chap. ): X T Xβ = X T Y () where X = x x d x x d x N x Nd Y = If we write out () we obtain (from now on all sums are from i = to N): N xi xid β xi x yi i xi x id β xi y i xij xij x ik β j = xij y i xid xid x i x id β d xid y i Now we write all the sums in terms of means and covariances: yi = Nȳ xij = N x j j =,..., d xij y i = (N )σ xj y + N x j ȳ j =,..., d xij x ik = (N )σ xj x k + N x j x k j =,..., d; k =,..., d where the last two lines are obtained by noting that: y y y N σ xj x k = N (xij x j )(x ik x k ) σ xj x k = N ( x ij x ik x ij x k x ik x j + x j x k ) σ xj x k = N ( x ij x ik N x j x k N x j x k + N x j x k ) (N )σ xj x k = x ij x ik N x j x k x ij x ik = (N )σ xj x k + N x j x k

22 Strohmann and Grudic If we plug these expressions into the linear system above we have: N N x N x d N x (N )σ x x + N x x (N )σ x x d + N x x d (N )σ xj x k + N x j x k N x d (N )σ xd x + N x d x (N )σ xd x d + N x d x d = Nȳ (N )σ x y + N x ȳ (N )σ xd y + N x d ȳ If we look at the first equation we retrieve the familiar formula for β : β β β d Nβ + d j= N x jβ j = Nȳ β = ȳ d j= β j x j Now look at the j + th (j =,..., d) equation which has x j as the common variable: N x j β + d k= β k[(n )σ xj x k + N x j x k ] = (N )σ xj y + N x j ȳ N x j (ȳ d k= β k x k ) + N d k= β k x j x k + (N ) d k= β kσ xj x k = (N )σ xj y + N x j ȳ N x j ȳ + (N ) d k= β kσ xj x k = (N )σ xj y + N x j ȳ (N ) d k= β kσ xj x k = (N )σ xj y d k= β kσ xj x k = σ xj y The last expression matches exactly the jth equation of (3). Since this is true for all j =,..., d we have shown that (3) / (4) describe the same system that appears for Least Squares problems. eferences M. A. Fischler and. C. Bolles. andom sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 4(6):38 395, 98. M. N. Gibbs. Bayesian Gaussian Processes for egression and Classification. PhD thesis, University of Cambridge, 997a. M. N. Gibbs. Tpros software: egression using gaussian processes, 997b. UL G.. G. Lanckriet, L. E. Ghaoui, C. Bhattacharyya, and M. I. Jordan. Minimax probability machine. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 4, Cambridge, MA, a. MIT Press.

23 obust Minimax Probability Machine egression G.. G. Lanckriet, L. E. Ghaoui, C. Bhattacharyya, and M. I. Jordan. A robust minimax approach to classification. Journal of Machine Learning esearch, 3:555 58, b. D. Mackay. Gaussian processes a replacement for supervised neural networks?, 997. Lecture notes for a tutorial at NIPS. A. W. Marshall and I. Olkin. Multivariate chebyshev inequalities. Annals of Mathematical Statistics, 3(4): 4, 96. I. Popescu and D. Bertsimas. Optimal inequalities in probability theory: A convex optimization approach. Technical eport TM6, INSEAD, Dept. Math. O.., Cambridge, Mass,. C. E. asmussen. Evaluation of Gaussian Processes and other Methods for Non-Linear egression. PhD thesis, University of Toronto, 996. B. Schölkopf, P. Bartlett, B. Smola, and. Williamson. Shrinking the tube: A new support vector regression algorithm. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems, Cambridge, MA, 999. MIT Press. B. Schölkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA,. B. Schölkopf, K.-K. Sung, C. J. C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik. Comparing support vector machines with gaussian kernels to radial basis function classiers. IEEE Transactions on Signal Processing, 45(): , 997. A.K. Jain S.J. audys. Small sample size effects in statistical pattern recognition: ecommendations for practicioners. IEEE Trans. Pattern Analysis and Machine Intelligence, 3(3):5 64, 99. A. Smola and B. Schölkopf. A tutorial on support vector regression. Technical eport NC-T-998-3, oyal Holloway College, 998. T.. Strohmann and G. Z. Grudic. A formulation for minimax probability machine regression. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 5, Cambridge, MA, 3. MIT Press. M. Tipping. The relevance vector machine. In S.A. Solla, T.K. Leen, and K.-. Müller, editors, Advances in Neural Information Processing Systems, Cambridge, MA,. MIT Press. D. Wakcerly, W. Mendenhall, and. Scheaffer. Mathematical Statistics with Applications, Sixth Edition. Duxbury Press, Pacific Grove, CA,. Christopher K. I. Williams and Carl Edward asmussen. Gaussian processes for regression. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, Cambridge, MA, 995. MIT Press. 3

Probabilistic Regression Using Basis Function Models

Probabilistic Regression Using Basis Function Models Gregory Z. Grudic Department of Computer Science University of Colorado, Boulder grudic@cs.colorado.edu Abstract Our goal is to accurately estimate