Robust Minimax Probability Machine Regression

Size: px
Start display at page:

Download "Robust Minimax Probability Machine Regression"

Transcription

1 obust Minimax Probability Machine egression obust Minimax Probability Machine egression Thomas. Strohmann Gregory Z. Grudic Department of Computer Science University of Colorado Boulder, C , USA strohman@cs.colorado.edu grudic@cs.colorado.edu Abstract We formulate regression as maximizing the minimum probability (Ω) that the true regression function is within ±ɛ of the regression model. Our framework starts by posing regression as a binary classification problem, such that a solution to this single classification problem directly solves the original regression problem. Minimax probability machine classification (Lanckriet et al., a) is used to solve the binary classification problem, resulting in a direct bound on the minimum probability Ω that the true regression function is within ±ɛ of the regression model. This minimax probability machine regression (MPM) model assumes only that the mean and covariance matrix of the distribution that generated the regression data are known; no further assumptions on conditional distributions are required. Theory is formulated for determining when estimates of mean and covariance are accurate, thus implying robust estimates of the probability bound Ω. Conditions under which the MPM regression surface is identical to a standard least squares regression surface are given, allowing direct generalization bounds to be easily calculated for any least squares regression model (which was previously possible only under very specific, often unrealistic, distributional assumptions). We further generalize these theoretical bounds to any superposition of basis functions regression model. Experimental evidence is given supporting these theoretical results. Keywords: Minimax Probability Machine, egression, obust, Kernel Methods, Bounds, Distribution free. Introduction In this paper we formulate the regression problem as one of maximizing the probability, denoted by Ω, that the predicted output will we within some margin ɛ of the true regression function. We show how to compute a direct estimate of this probability for any given margin ɛ >. Following the idea of the minimax probability machine for classification (MPMC) due to Lanckriet et al. (a), we make no detailed distributional assumptions, but obtain a probability Ω that is a lower bound for all possible distributions with a known mean and covariance matrix. In other words, we maximize the minimum probability of our regression model being within ±ɛ correct on test data for all possible distributions that. Throughout this paper, the phenomenon that generated the training data is referred to as the true regression function (or surface), while the model that is constructed from the training data is referred to as the regression model.

2 Strohmann and Grudic have the same mean and covariance matrix as the distribution that generated the training data. We term this type of regression model as a minimax probability machine regression (MPM) (see Strohmann and Grudic, 3). Current practice for estimating how good a regression model is dictates that one has to either estimate the underlying distribution of the data or make Gaussian assumptions, both of which have their problems. Estimating high dimensional distributions from small samples is a very difficult problem that has not been solved satisfactory (see S.J. audys, 99). Gaussian Processes obtain regression models within a Bayesian framework (see Williams and asmussen, 995, asmussen, 996, Gibbs, 997a, Mackay, 997). As the name indicates they put a prior on the function space that is a generalization of a Gaussian distribution for (a finite number of) random variables. Instead of mean and covariance, Gaussian Processes use a mean function and a covariance function to express priors on the function space. By adopting these restrictions it is possible to obtain confidence intervals for a given regression estimate. In practice, however, the assumption that data is generated from underlying Gaussians does not always hold which can make the Gaussian Process confidence intervals unrealistic. Similar problems with unrealistic bounds occur in such algorithms as the relevance vector machine (Tipping, ) where approximations, which may be unrealistic, are made in obtaining these. Whereas as long as one can obtain accurate estimates of mean and covariance matrix our distribution free algorithm will always yield a correct lower bound estimate, no matter what the underlying distribution is. Because the accuracy of the proposed regression algorithm directly depends on the accuracy of mean and covariance estimates, we propose a theoretical framework for determining when theses estimates are accurate. This in turn determines when the estimate of the bound Ω is accurate. The general idea is to randomly discard a (small) subset of the training data and measure how sensitive the estimate of Ω is when we use the remaining data to estimate mean and covariance matrix. This approach has some resemblance to the ANSAC (ANdom SAmpling and Consensus) method, originally developed by Fischler and Bolles (98) for the task of image analysis. To solve the regression problem, we reduce it to a binary classification problem by generating two classes that are obtained by shifting the dependent variable ±ɛ (see Figure ). The regression surface is then interpreted as being the boundary that separates the two classes. Although any classifier can be used to solve the regression problem in this way, we propose to use MPMC (Lanckriet et al., a) which gives a direct bound on Ω. One result of this is paper is a theorem stating that when MPMC is used to solve the regression problem as described above, the regression model is identical to the one obtained by standard least squares regression. The importance of this result is that it is now easy to calculate a generalization bound for the least squares method. This was previously only possible under strict, often unrealistic, distributional assumptions. The paper is organized as follows: In section we give a strict mathematical definition of our approach to the regression problem and describe our hypothesis space of functions for linear and nonlinear regression surfaces. Section 3 outlines the statistical foundation of the minimax probability machine, based on a theorem due to Marshall and Olkin (96) that was later on extended by Popescu and Bertsimas (). We then show how any regression problem can be formulated as a binary classification problem by creating two symmetric classes: one has the output variable shifted by +ɛ, the other one has it shifted by ɛ (see

3 obust Minimax Probability Machine egression Figure ). Following this scheme we conclude the section by deriving the linear MPM ε 4 3 u.5 ε.5 v classification boundary = regression surface a) regression problem b) artificial classification problem Figure : Formulating regression as binary classification from the linear MPMC and demonstrating that due to the symmetry it is computationally easier to find an MPM regression model than to find an MPM classifier. In section 4 we describe how to obtain nonlinear regression surfaces by using any set of nonlinear basis functions and then focus on using nonlinear kernel maps as a distance metric on pairs of input vectors. The problem of overfitting the training data with highly sensitive nonlinear kernel maps is addressed in section 5. There we propose a robustness measure that reflects how sensitive our estimate of Ω is with respect to the estimates of mean and covariance (which are the only estimates MPM relies upon). Section 6 contains the experiments we conducted with our learning algorithm on toy problems and real world data sets. We start out with a simple toy example and show that for certain distributions the Gaussian Process error bounds fail while the ones obtained by MPM still hold. For the real world data sets we focussed on investigating the accuracy of Ω as a lower bound probability on ±ɛ-accurate predictions and its relation to our robustness measure. The last section summarizes the main contributions of this paper and points out some open research questions concerning the MPM framework. A Matlab implementation of the MPM algorithm can be obtained from: strohman/papers.html. egression Model Formally, MPM addresses the following problem: Let f : d be some unknown regression function (i.e. the phenomenon that generated the learning data), let the random vector x d be generated from some bounded distribution that has mean x and covariance Σ x which we will write from now on as x ( x, Σ x ) and let our training data be generated according to: y = f (x) + ρ 3

4 Strohmann and Grudic where ρ is a noise term with an expected value of E[ρ] = ρ = and some finite variance V ar[ρ] = σ ρ. Given a hypothesis space H of functions from d to, we want to find a model ˆf H that maximizes the minimum probability of being ±ɛ accurate. We define: Ω f = inf P r{ f(x) y < ɛ} () (x,y) ( x,ȳ,σ) where Σ = cov ( x, x,... x d, y ) with Σ (d+) (d+) is the complete covariance matrix of the random vector x and the random variable y = f (x) + ρ (see equation () for details). The only assumptions we make about the first and second moments of the underlying distribution is that x, ȳ and Σ are finite. No other distributional assumptions are made. For any function f H, the model ˆf that we are looking for has to satisfy: Ω ˆf Ω f for all f H For linear models, H contains all the functions that are linear combinations of the input vector: d f(x) = β j x j + β = β T x + β (where β d, β ) () j= For nonlinear models we consider the general basis function formulation: f(x) = k β i Φ i (x) + β (3) i= In this paper we focus on the commonly used kernel representation K : d d (see e.g. Smola and Schölkopf, 998) where the hypothesis space consists of all linear combinations of kernel functions with the training inputs as their first arguments (i.e. Φ i (x) = K(x i, x)): N f(x) = β i K(x i, x) + β i= where x i d denotes the ith input vector of the training set Γ = {(x, y ),..., (x N, y N )}. In the nonlinear formulation we consider the distribution of the random vector of basis functions Φ = (Φ, Φ,..., Φ N ) to formulate the MPM bound Ω: Ω f = inf P r{ f(x) y < ɛ} (Φ,y) ( Φ,ȳ,Σ) where Σ = cov ( y, Φ, Φ,... Φ N ) see equation () for details on calculating Σ. 4

5 obust Minimax Probability Machine egression 3. Linear Framework In this section we will develop the linear MPM by reducing the regression problem to a single (specially structured) binary classification problem. We will also show that the regression estimator of the linear MPM is equivalent to the one obtained by the method of least squares. The terms random vector and class will be used interchangeably in a sense that all examples of a class are assumed to be generated from its associated random vector. 3. Minimax Probability Machine Classification (MPMC) The minimax binary classifier for linear decision boundaries can be formulated as a hyperplane that maximizes the minimum probability of correctly classifying data points generated from the two underlying random vectors (Lanckriet et al., a,b). Assume we have random vectors u and v that are drawn from two probability distributions characterized by (ū, Σ u ) and ( v, Σ v ), the linear MPMC algorithm solves the following optimization problem: max Ω s.t. inf P Ω,a,b u (ū,σ u ) r{at u b} Ω inf P v ( v,σ v ) r{at v b} Ω (4) A theorem due to Marshall and Olkin (96) that was later on extended by Popescu and Bertsimas () states a formula for the supremum probability that a random vector lies on one side of a hyperplane: sup P r{a T v b} = v ( v,σ v) + δ, with δ = inf (w v) T Σ a T v (w v) w b Where the infimum can be computed analytically from the hyperplane parameters a and b and the estimates v, Σ v (Lanckriet et al., b). This result can be used to simplify the optimization problem (4) to the following one: m = (min a a T Σ u a + a T Σ v a) s.t. a T (ū v) = (5) The offset b of the hyperplane is uniquely determined for any a: a b = a T ū T Σ u a a = a T v + T Σ v a m m Once a minimum value m is obtained for (5), the probability bound Ω can be computed as: Ω = m + 3. Formulating egression as Classification In a linear regression model we want to approximate f : d with a linear estimator, i.e. a function of the form: ŷ = ˆf(x) = β T x + β To obtain a minimax regression model i.e. one which maximizes the minimum probability that future points are predicted within ±ɛ correctly we use the MPMC framework in (6) 5

6 Strohmann and Grudic the following way. Each training data point (x i, y i ) for i =,..., N turns into two d + dimensional vectors, one is labelled as class u (and has the y value shifted by +ɛ) the other one as class v (and has the y value shifted by ɛ, see also Figure ): u i = (y i + ɛ, x i, x i,..., x id ) v i = (y i ɛ, x i, x i,..., x id ) i =,..., N i =,..., N (7) It is interesting to note that we could use any binary classification algorithm to solve this artificial classification problem. The boundary obtained by the classifier turns directly into the regression surface one wants to estimate. For example one could even use decision trees as underlying classifier, determine where the decision tree changes its classification, and use this boundary as the regression output. In this paper we focus on the consequences of using MPMC as the underlying classifier for the problem defined by (7). Once the hyperplane parameters a and b have been determined by the MPMC, we use the classification boundary a T z = b to predict the output ŷ for a new input x (where z is defined as z = (ŷ, x, x,..., x d )): a T z = b (classification boundary) a ŷ + a x + a 3 x a d+ x d = b ŷ = a a x a 3 a x... a d+ a x d + b a (a ) ŷ = β x + β x β d x d + β = β T x + β (regression function) (8) The symmetric ±ɛ shift in the output direction has a number of useful properties. First of all, the probability bound Ω of correct classification is equivalent to the probability that future predictions will be within ±ɛ accurate. oughly speaking, if a hyperplane classifies a point of class u correctly, then the corresponding y value can be at most ɛ below the hyperplane. Similarly, for points of class v that are classified correctly, the corresponding y value can be at most ɛ above the hyperplane (see the next section for a rigorous proof). Second, the artificial classification problem has a much simpler structure than the general classification problem. In particular, we find that Σ u = Σ v =: Σ and ū v = (ɛ,,..., ) T. This allows us to simplify (5) even further, we can minimize a T Σ a or equivalently a T Σ a with respect to a. The next section shows how to solve this minimization problem by just solving one linear system. 3.3 Linear MPM Given an estimate Σ = Σ u = Σ v d+ d+ of the covariance matrix for the random vector (y, x) that generated the data, we can simplify (5): min a a T Σ u a + a T Σ v a s.t. a T (ū v) = min a a T Σ a s.t. a T (ɛ,,..., ) T = Since we are only interested in the minimum we can get rid of the square root and the factor of. The goal of the linear MPM can then be formulated as the solution to the following constrained optimization problem for a d+ : min a a T Σ a s.t. (ɛ,,..., )a = (9) 6

7 obust Minimax Probability Machine egression We use Lagrangian multipliers to turn the problem into an unconstrained optimization problem: L(a, λ) = a T Σ a + λ((ɛ,,..., )a ) By setting the derivatives L(a,λ) a = and L(a,λ) λ = we obtain the following system of linear equations which describe a solution of (9): Σ a + (λɛ,,..., ) T = () and a = ɛ () where Σ = σ yy σ yx σ yxd σ x y σ x x σ x x d σ xd y σ xd x σ xd x d () is the full sample covariance matrix for the training data Γ = {(x, y ),..., (x N, y N )}. The entries of Σ are estimated as follows: x j = N N i= x ij j =,..., d ȳ = N N i= y i σ yy = N N i= (y i ȳ) σ xj y = N N i= (x ij x j )(y i ȳ) = σ yxi j =,.., d σ xj x k = N N i= (x ij x j )(x ik x k ) j =,..., d; k =,.., d The system () has d + equations and d + unknowns: a, a,..., a d+ and the Lagrange multiplier λ (also note that () guarantees the requirement a of (8) for an arbitrary ɛ > ). For any values a, a,..., a d+, the first equation of () can always be solved by setting the free variable λ appropriately: λ = (Σa) ɛ We write out the other d equations of () and omit the factor of, i.e.: σ x y σ x x σ x x d a σ x y σ x x σ x x d a = σ xd y σ xd x σ xd x d a d+ ecall that we are interested in the coefficients β j which can be written as β j = a j+ a for j =,..., d (see (8)). Therefore we rearrange the system above in terms of β j (by multiplying out the σ xj ya terms, bringing them to the other side, and dividing by a ): a σ x y a σ x y a σ xd y + σ x x σ x x σ x x d σ x x σ x x σ x x d σ xd x σ xd x σ xd x d 7 a a 3 a d+ =

8 Strohmann and Grudic becomes σ x x σ x x σ x x d σ x x σ x x σ x x d σ xd x σ xd x σ xd x d a a 3 a d+ = a σ x y a σ x y a σ xd y and finally (divide by a and substitute β j = a j+ a for j =,..., d) σ x x σ x x σ x x d σ x x σ x x σ x x d σ xd x σ xd x σ xd x d β β β d = σ x y σ x y σ xd y (3) One can show (see Appendix A) that this is exactly the same system which appears for solving linear least squares problems. Consequently the offset β should also be equivalent to the offset β = ȳ β T x of the least squares solution. To see why this is true, we start by looking at the offset b of the hyperplane: Therefore, we have: b = a T ū a T Σ a a T Σ a+ a T Σ a = ( ɛ, β ɛ,..., β d ɛ )(ȳ + ɛ, x,..., x d ) T = ɛ (ȳ + ɛ d i= β i x i ) ɛ ɛ = ɛ (ȳ d i= β i x i ) from (6) β = b a = b ɛ = ȳ From (3) and (4) we obtain our first theorem: d β j x j = ȳ β T x (4) Theorem The linear MPM algorithm computes the same parameters β, β, β,..., β d as the method of least squares. j= Proof. Derivation of (3) and (4). We have shown that for the linear case the regression estimator we obtain from MPM is the same one gets from least squares. The MPM approach, however, allows us to compute a distribution free lower bound Ω on the probability that future data points are predicted within ±ɛ correctly. We don t know of any other formulation that yields a bound like that for the commonly used least squares regression. To actually compute the bound Ω, we estimate the covariance matrix Σ from the training data and solve (3) for β. Then, for any given margin ɛ > we set a = ɛ, a j+ = β j a (j =,..., d) and calculate: Ω = 4a T Σ a + The following theorem establishes that the Ω claculated in (5) is a lower bound on the probability of future predictions being ±ɛ accurate (if we have perfect estimates for x, ȳ, Σ). (5) 8

9 obust Minimax Probability Machine egression Theorem For any distributions Λ x with E[Λ x ] = x, V [Λ x ] = Σ x and Λ ρ with E[Λ ρ ] =, V [Λ ρ ] = σ (where x, Σ x, σ are finite) and x = (x,..., x d ) Λ x, ρ Λ ρ let y = f (x) + ρ be the (noisy) output of the true regression function. Assume perfect knowledge of the statistics x, ȳ, Σ (where Σ is defined in ()). Then the MPM model ˆf : d satisfies: Proof. inf P r{ ˆf(x) y < ɛ} = Ω (x,y) ( x,ȳ,σ) For any d + dimensional vector z = (y, x,..., x d ) we have: z is classified as class u y > ˆf(x) z is classified as class v y < ˆf(x) (this follows directly from the construction of the d + dimensional hyperplane). Now look at the random variables z +ɛ = (y + ɛ, x) and z ɛ = (y ɛ, x). The distribution of z +ɛ satisfies z +ɛ ((ȳ + ɛ, x), Σ) which has the same mean and covariance matrix as our (artificial) class u (ū, Σ u ). Similarly, we find that z ɛ ( v, Σ v ). The MPM classifier guarantees that Ω is a lower bound of correct classification. Formally, correct classification is the probabilisti event that z +ɛ is classified as u and that at the same time z ɛ is classified as v. If we use the two equivalences above we obtain: P r{z +ɛ classified as u z ɛ classified as v} = P r{y + ɛ > ˆf(x) y ɛ < ˆf(x)} = P r{ ˆf(x) y < ɛ} Since Ω is the infimum probability over all possible distributions for z +ɛ and z ɛ it is also the infimum probability over the equivalent distributions for x and y. This completes the proof that Ω is a lower bound probability for the MPM model being ±ɛ accurate. It is interesting to note that once we have calculated a model we can also apply (5) in the other direction, i.e. for a given probability Ω [, ) we can compute the margin ɛ such that Ω will be a lower bound on the probability of test points being ±ɛ accurate. Simple algebra shows that 4a T Σ a = (σ yy d j= β jσ yxj )/ɛ : 4 (a, a,, a d+ ) σ yy σ yx σ yxd σ x y σ x x σ x x d σ xd y σ xd x σ xd x d We substitute a = ɛ and a j+ = β j a for j =,..., d: σ yy σ yx σ yxd 4 (/ɛ, β /ɛ,, β d /ɛ) σ x y σ x x σ x x d σ xd y σ xd x σ xd x d a a a d+ /ɛ β /ɛ β d /ɛ We factor our /ɛ from each of the two vectors (the 4 cancels out with ( ) ): σ yy σ yx σ yxd ɛ (, β,, β d ) σ x y σ x x σ x x d β σ xd y σ xd x σ xd x d β d 9

10 Strohmann and Grudic We do the matrix vector multiplication on the right side. σ xi y (i =,..., d) from (3): σ yy d j= ɛ (, β σ yx j β j,, β d ) σ x y σ x y σ xd y σ xd y Note that d j= σ x i x j β j = The last d components of the column vector are zero, hence the expression simplifies to: d σ ɛ yy β j σ yxj j= The numerator ν = σ yy d j= β jσ yxj establishes the relation between Ω and ɛ. It is important to note that the lower bound Ω is also valid for linear and nonlinear (see section 4.) regression models generated from other learning algorithms. However, for other models β may not satisfy (3), i.e. the last d components don t zero-out and we have to compute the whole expression: ν = (, β,, β d ) σ yy σ yx σ yxd σ x y σ x x σ x x d σ xd y σ xd x σ xd x d β β d (6) If we set β according to (4) we can, for example, compute distribution free error bounds of an SVM for any given confidence Ω. We summarize these results in the following theorem: Theorem 3 For any linear regression model f(x) = d j= β jx j + β we have the following relationships between the distribution free lower bound probability Ω and the margin size ɛ: Ω = ν/ɛ + Ω ɛ = ν Ω where ν is defined in (6). Proof. Combining (5) and (6) we conclude that we can write Ω as: Ω = 4a T Σ a + = ν/ɛ + for any basis function regression model (which completes the first part of the proof). If we solve this equation for ɛ we obtain: Ω = ν/ɛ + ( ν/ɛ + )Ω = ( ν + ɛ )Ω = ɛ νω = ɛ ( Ω) ɛ = ν Ω Ω

11 obust Minimax Probability Machine egression 4. Nonlinear Framework We now extend the above linear regression formulation to nonlinear regression using the basis function formulation: k f(x) = β i Φ i (x) + β where the basis functions Φ i : d can be arbitrary nonlinear mappings. 4. Kernel Maps as Features i= In this paper we focus our analysis and experiments on Mercer kernels and use the kernel maps K(x i, ) that are induced by the training examples as a basis. Instead of d features each input vector is now represented by N features; the kernel map evaluated at all of the other training inputs (including itself). f(x) = N β i K(x i, x) + β i= We then solve the regression problem with the linear MPM, i.e. we minimize the following least squares expression: where min β N N (y i ( β j z ij + β )) i= z ij = K(x i, x j ) j= i =,..., N; j =,..., N A new input x is then evaluated as: ŷ = ˆf(x) = N β i K(x i, x) + β i= Commonly used kernel functions include the linear K(s, t) = s T v, the polynomial K(s, t) = (s T v + ) p and the Gaussian kernel K(s, t) = exp( (s t)t (s t) ) (see for example Schölkopf σ and Smola, ). Note that the learning algorithm now has to solve an N-by-N system instead of a d-by-d system: σ z z σ z z σ z z N σ z z σ z z σ z z N σ zn z σ zn z σ zn z N β β β N = σ z y σ z y σ zn y and setting N β = ȳ β j z j i=

12 Strohmann and Grudic where ȳ = N N i= y i z j = N N i= z ij j =,..., N σ zj y = N N i= (z ij z j )(y i ȳ) = σ yzi j =,.., N σ zj z k = N N i= (z ij z j )(z ik z k ) j =,..., N; k =,.., N The proof of Theorem 3 generalizes in a straightforward way to the basis function framework: Theorem 4 For any basis function regression model f(x) = N i= β iφ i (x) + β we have the following relationships between the distribution free lower bound probability Ω and the margin size ɛ: Ω = ν/ɛ + Ω ɛ = ν (7) Ω where σ yy σ yz σ yzn ν = (, β,, β N ) σ z y σ z z σ z z d β σ zn y σ zn z σ zn z N β N Proof. Analogous to proof of Theorem A Different Approach In previous work we proposed a different nonlinear MPM formulation which is based on the nonlinear MPM for classification by Lanckriet et al. (a). Instead of using kernel maps as features, this formulation finds a minimax hyperplane in an higher dimensional space. Like kernelized support vector machines, one considers the (implicit) mapping ϕ : d h and its dot products ϕ(x i ) T ϕ(x j ) =: K(x i, x j ). This MPM algorithm (we call it kernel- MPM since it requires Mercer kernels) was also obtained by using the MPM classifier on an artificial data set where the dependent variable, shifted by ±ɛ, is added as first component to the input vector (yielding the d+-dimensional vector (y, x,..., x d )). The kernel-mpm model looks like: N γ i K(z i, z) + γ = (8) i= where z i = (y i + ɛ, x i,..., x id ) for i =,..., N and z i = (y i N ɛ, x i N,,..., x i N,d ) for i = N +,..., N. Evaluating a new point x amounts to plugging in z = (y, x) into (8) and solving the equation for y. For an arbitrary kernel this would imply that we have to deal with a nonlinear equation which is at best computationally expensive and at worst not solvable at all. Therefore, the kernel is restricted to be output-linear, i.e. we require: K((s, s,..., s d+ ), (t, t,..., t d+ )) = s t + K ((s,..., s d+ ), (t,..., t d+ )) (9) The theory of positive definite kernels (see for example Schölkopf and Smola, ) states that positive definite kernels are closed under summation, which means that we could use

13 obust Minimax Probability Machine egression an arbitrary positive kernel for K and obtain a valid kernel K (of course the linear output kernel is also positive definite). By using a kernel K of the form (9), we can analytically solve (8) for y and obtain the regression model: [ ] N y = ( ɛ) ( (γ i + γ i+n )K (x i, x)) + γ i= If we compare the two different MPM regression algorithms we note that the computational cost of learning an kernel-mpm model is bigger because instead of an N N system we have to solve an N N system to obtain the coefficients γ. For MPM we are not restricted to Mercer kernels but can use any set Φ i of nonlinear basis functions. It is an open question as to whether MPM and kernel-mpm are equivalent if we use Mercer kernels for the MPM. 5. obust Estimates The MPM learning algorithm estimates the covariance matrix from the training data. Of course, these statistical estimates will be different from the true values and thus the regression surface itself and the lower bound probability Ω may be inaccurate. Especially for the nonlinear case the problem of overfitting the training data with highly sensitive kernel functions arises, so our goal is to find kernels that produce robust estimates. One way of dealing with the problem of selecting the right kernel (i.e. model) is to do cross validation on the training data and pick the kernel which yields the best result. In this paper we explore a different approach which is computationally much more efficient. The idea is to randomly select a fraction α of rows that will be left out when calculating the covariance matrix (note that we measure the covariance columnwise, i.e. deleting rows will not affect the dimension of the covariance matrix). If the covariance matrix of the remaining training data is reasonably close to the covariance matrix of all the training data, we consider our model to be robust. We now define in mathematical terms what reasonably close means. The objective of the MPM algorithm is to maximize the lower bound probability Ω which is determined by the term a T Σ a. Therefore, we measure the sensitivity of an MPM model by looking at the sensitivity in a T Σ a. We call a model α-robust if the relative error for this expression is α when leaving a fraction of α points out for computing the covariance matrix: Definition Σ - true covariance matrix a - minimizes a T Σ a s.t. a = ɛ Σ - cov. matrix of all N examples a - minimizes a T Σ a s.t. a = ɛ Σ α - cov. matrix of ( α)n examples a α - minimizes a T ασ α a α s.t. a α = ɛ A model is α robust : at α Σ a α a T Σ a a T α Σ a α α We can not apply this definition directly since we don t know the true covariance matrix Σ but we will prove that under the following two assumptions we can obtain a formula that 3

14 Strohmann and Grudic does not involve Σ. ) a T Σ a a T ασ a α (whole model is more accurate than ( α)n model) ) at Σa a T α Σa α at Σ a a T Σ a (for α small a α is closer to a than a is to a ) Under these assumptions the following theorem holds: Theorem 5 If a model is α-robust then the ratio α : at α Σa α a T Σa a ασa α. In formulas: a T ασ a α a T Σ a a T ασ a α α = α : at ασ a α a T Σ a a T ασ a α () Proof of Theorem: from ) at Σ a a T α Σ a α at Σ a a T Σ a from ) a T Σ a a T ασ a α at Σ a aσ T a from premise) at α Σ a α a T Σ a a T α Σ a α combine inequalities: a T Σ a a T α Σ a α a T Σ a a T α Σ a α at Σ a a T Σ a at Σ a a T α Σ a α α at Σ a a T α Σ a α a Σ a a α Σ a α α at Σ a a T α Σ a α α α α α at α Σ a α a T Σ a a T α Σ a α α : at α Σ a α a T Σ a a T α Σ a α We refer to the number = α : at α Σ a α a T Σ a a T α Σ a α as robustness measure or confidence measure. In our experiments we apply the theorem to show that a model is not α-robust. A model where we find that < will be rejected because of too much sensitivity in Ω. A different approach for constructing a robust minimax probability machine has been investigated by Lanckriet et al. (b) for the task of classification. There the main idea is to put some error bounds on the plug-in estimates of mean and covariance. Instead of a single mean, the robust classifier now considers an ellipsoid of means which is centered around the estimated mean and bound in size by an error bound parameter ν. Similarly, instead of a single covariance matrix one now has a matrix ball, where each matrix is at most ρ away (measured in Frobenius norm) from the estimated covariance matrix. When a robust model is calculated, the lower bound probability Ω (for correct classification) decreases when either ν or ρ increase. The model itself is independent of ν, but it can depend on ρ and a small toy example shows (see Lanckriet et al., b, p. 578) that a robust model (ρ > ) can yield a better performance than a simple one (where ρ = ). From a theoretical point of view the two approaches differ in a way that our method is output oriented (it considers the sensitivity of Ω which is the objective that an MPM maximizes) whereas the other method is more input oriented (it considers the sensitivity of the plug-in estimates derived from the training data). Both approaches rely on sampling methods to make robust estimates: We sample a fraction ( α percent) of examples to determine the sensitivity of Ω, Lanckriet et al. (b) use sampling methods to determine reasonable values for the parameters ν and ρ. Our method, however, has the advantage 4

15 obust Minimax Probability Machine egression that it provides a straightforward way to do model selection: choose the model that has a maximal Ω while satisfying >. It is not immediately clear how a strategy for model selection can be implemented with the approach of Lanckriet et al. (b). It is an open question as to which practical applications either approach is preferable to the other. 6. Experimental esults In this section we present the results of the robust MPM for toy problems and real world regression ( benchmarks. For all of the data sets we used a Gaussian kernel K(x i, x j ) = exp x i x j to obtain a nonlinear regression surface. We plot the behavior of Ω and σ ) for a range of kernel parameters σ which is centered around σ with σ =.3 d where d is the input dimension of the data set. Each step, the parameter σ is increased by a factor of.. For all our experiments we used α =. as the fraction of points that are left out for the estimation of. To quantify how accurate our lower bound estimate Ω is, we compute the number of test points that are within ±ɛ correct (Ω T ) for each test run. The common behavior in all data sets is that small values for σ (which yield bad models due to overfitting) have a tiny value for, i.e. the corresponding model will be rejected. As σ increases, we obtain more robust estimates of the covariance matrix and therefore more robust MPM models that will not be rejected. The point of interest is where the robustness measure crosses because this will be the place where we become confident in our model and our Ω bound. 6. Toy Example We used the sinc function x sin(πx)/(πx) as a toy example for our learning algorithm. A noise term drawn from a Gaussian distribution with mean and variance σ =. was added to the examples. Both our training and test set consisted of N train = N test = examples (see Figure 3a). We tried 3 different values for σ and averaged over test runs for each σ. For both a fixed Ω and a fixed ɛ we observed that Ω became a true lower bound on Ω T right at the point where our confidence measure crossed (see Figures a,b). This toy experiment showed that the robustness ratio is suitable for measuring the robustness of Ω estimates. 6.. Comparison to Gaussian Processes In this section we also use the sinc function as toy data but add a noise variable ρ that is not drawn from a (single) Gaussian distribution. Instead, we generate a distribution from two separate Gaussians: With probability p =.5, ρ is drawn from N (µ, ) and with probability p =.5, ρ is drawn from N ( µ, ) (See figure 3b). We tested 3 values for µ in the range µ =.,., Both the training and test set had N train = N test = examples. The Gaussian Process implementation we used is provided by Gibbs (997b). We calculated the error bounds of the Gaussian Process for σ = one standard deviation. In the MPM framework this corresponds to a value of Ω =.689. By using (7) we computed. The interested reader is referred to Schölkopf et al. (997) and Schölkopf et al. (999) for a discussion on this particular choice for σ. 5

16 Strohmann and Grudic Ω T.8 Ω T.6.4. Ω ε.6.4. ε Ω a) for Ω = const and > we have Ω < Ω T b) for ɛ = const and > we have Ω < Ω T Figure : Toy example.5 Training data sinc function MPM model ± ε pdf(x) a) sinc function with Gaussian noise b) Two Gaussian noise for µ = x Figure 3: the MPM error bounds for each distribution. As expected,the Gaussian Process error bounds ɛ GP were tighter (i.e. more optimistic) than the worst case MPM error bounds ɛ MP M (see Figure 4a). For small values of µ where the noise distribution is still close to a (single) Gaussian the Gaussian Process error bounds are still accurate, i.e. they match to the measured percentage of ±ɛ accurate predictions. As µ increases and the noise distribution becomes less Gaussian, the Gaussian Process error bounds are no longer a good estimate on the prediction accuracy. The MPM error bounds, however, are valid for all possible values of µ (See figure 4b): 6

17 obust Minimax Probability Machine egression ε MPM.8 Ω T MPM.75.5 ε GP.7.65 Ω a) Error bounds µ.6.55 Ω T GP b) ±ɛ accurate test points µ Figure 4: Comparison MPM with Gaussian Processes 6. Boston Housing The first real world data set we experimented with is the Boston Housing data which consists of N = 56 examples, each with d = 3 real valued inputs and one real valued output. We divided the data randomly into a training set of size N train = 48 and a test set of size N test = 5. We tried 75 different values for σ, and for each of these we averaged our results over runs. For the first plot (Figure 5a) we fixed Ω =.5 and calculated the corresponding ɛ for the model. Additionally, we plotted the confidence measure and the percentage Ω T of test points within ±ɛ of our regression surface. As the number became bigger than, we noticed that Ω was a true lower bound, i.e. Ω < Ω T. The second plot (Figure 5b) has a fixed ɛ which was set to the value of ɛ where became greater than in the first plot. We recomputed Ω and Ω T and saw as before that Ω < Ω T for >. It is also remarkable that the best regression model (i.e. the one that maximizes Ω T for a given ɛ) is right at the point where crosses. The next plot (Figure 5c) shows the percentage of bad models (where Ω > Ω T ) which is below % for > and approaches zero as increases. This shows that the more robust a model is in terms of the more unlikely it is for Ω not to be a true lower bound on the probability of ±ɛ accurate prediction. The fourth plot (Figure 5d) shows false positives and false negatives for the robustness measure itself. Note that we have a high false negative rate shortly before crosses. The reason for this effect is that a model with, say =.8, will be rejected, but Ω < Ω T may still hold because Ω is a lower bound for all possible distributions, i.e. for specific distributions Ω is in general a pessimistic estimate. False positives are rare: when we accept a model because of > it is very unlikely (< 5%) that the model will be inaccurate in a sense that Ω T is smaller than Ω. Our last plot (Figure 5e) shows the connection of our regression approach to the more commonly used one which tries to minimize the (normed) mean squared error. We found that the normed MSE ( MSE model MSE mean ) has a minimum right at the point where crosses and Ω T had a maximum in the second plot. 7

18 Strohmann and Grudic ε.5 ε Ω T Ω Ω T Ω Ω > Ω T a) for > we have Ω T > Ω, i.e. Ω is a lower bound.9 b) for ɛ = const the maximal Ω T is around =.8 c) for > the number of bad models < %.8.7 false negative false positive NMMSE d) many false negatives (Ω is worst case bound) e) minimum MSE corresponds to maximum Ω T Figure 5: Boston Housing esults 6.3 Abalone The ABALONE data set contains N = 477 examples, each with 8 inputs and one realvalued output. One of the input features is not a real number but can take on one of three distinct values. We split up the discrete-valued feature into 3 real valued features, setting one component to and the other two to, which increases the input dimension to d =. The data was then divided randomly into N train = training examples and N test = 377 test examples. As in the Boston experiment we went over a range of kernel parameters (here 3) and averaged the results over a number of test runs (here 5). Again, we plotted ɛ for a constant Ω, Ω for a constant ɛ, the bad models, false positives/negatives for, and the normed mean squared error (see Figure 6). The overall behavior was the same that we observed for the Boston Housing data: Models with a confidence measure of > yielded accurate lower bounds Ω. The false positive and false negative rates were more extreme for this data set: We didn t get false positives anymore but on the other side the false negative percentage was very high for slightly less than. Also, models with a very insensitive kernel parameter tended to be about as good as models with a kernel parameter in the range of. We ascribe this effect to the high noise level of the Abalone data set which puts a bias towards insensitive models. 8

19 obust Minimax Probability Machine egression.5 ε ε Ω T Ω Ω T Ω Ω > Ω T a) b) c) false negative (no false positives) NMMSE d) e) Figure 6: Abalone esults 6.4 Kinematic Data The KIN8NM data set consists of N = 89 examples, each with 8 real valued inputs and one real valued output. We used the same parameters as in the ABALONE data set: N train = training examples, a range of 3 kernel parameters and 5 test runs to average the results for each kernel parameter. As in the previous experiments, the robustness measure provided a good estimate for the point where Ω becomes a lower bound on Ω T. For this data set, however, the maximum Ω T and the minimum MSE were not at the point where crossed, but occurred at a point where was still less than one (See Figure 7). 7. Conclusion & Future Work We have presented the MPM framework as a new approach to address regression problems where models are represented as nonlinear (or linear) basis function expansions (see () and (3)). The objective of MPM is to find a model that maximizes the worst case probability that future data is predicted accurately within some ±ɛ bound. Also, we introduced a general approach to reduce any regression problem to a binary classification problem by shifting the dependent variable both by +ɛ and by ɛ. In this paper, we used MPMC as underlying classifier which gives us a distribution free probability bound Ω, and therefore a regression algorithm that can estimate its ability to make ±ɛ correct predictions. emarkably, the resulting MPM is equivalent to the method of least squares which is a standard technique in statistical analysis. One consequence of our work is that it is now possible to 9

20 Strohmann and Grudic Ω > Ω T ε Ω T Ω ε Ω T Ω a) b) c) false positive.6.4 false negative.4 NMMSE d) e) Figure 7: Kinematic data esults calculate distribution free confidence intervals for least squares solutions (formerly this was only feasible under Gaussian or other specific, often unrealistic, distriubtional assumptions). For nonlinear problems we used Mercer kernels as basis functions for our regression model, however, our formulation is not limited to Mercer kernels but works with arbitrary basis functions. Typically, there are kernel parameters (e.g. the width σ of the Gaussian kernel) that need to be adjusted properly to obtain good models. A common technique to find these parameters is to do -fold cross validation which is computationally expensive. Our approach, however, does not involve cross validation but is based on a robustness measure. is calculated by sampling parts of the training data and looking on the effect on Ω the objective of the MPM. This is computationally more efficient since N- fold cross validation requires to build N models whereas we only need to build two models. Experiments with real world data sets have shown that the robustness measure works well for filtering out models that overfit the data and consequently violate the property of Ω as a true lower bound. Empirical evidence is given showing that can be used to find optimal kernel parameters and therefore is suitable for model selection. Future research on MPM will focus on investigating the possibility of doing model selection solely based on without any cross validation. As a consequence of Theorem 4 (Section 4.), we are looking into calculating lower bounds Ω for basis function regression models that are coming from other algorithms (e.g. the support vector machine regression described in Smola and Schölkopf (998)). In combination with the robustness measure (and possibly a model selection criterion) this idea could help to improve the accuracy of a whole class of regression algorithms. Also, the current implementation requires solving an

21 obust Minimax Probability Machine egression N N linear system of equations (complexity O(N 3 )) and therefore is prohibitive for large N (where N is the number of training examples). As a result we are currently working on Sparse MPM. Finally, we note that for N MPM converges to the true Ω (and the optimal model) since the estimates of mean and covariance matrix become more and more accurate as the number of training examples (N) increases. We are currently looking into the rate of convergence for Ω as N increases, which should be directly related to the rate of convergence for the mean and covariance estimates. A Matlab implementation of the MPM algorithm can be obtained from: strohman/papers.html 8. Acknowledgements We are grateful to Mark Gibbs for providing an implementation to do regression estimation with Gaussian Processes. 9. Appendix A The least squares problem can be formulated as the following linear system (see for example Wakcerly et al.,, chap. ): X T Xβ = X T Y () where X = x x d x x d x N x Nd Y = If we write out () we obtain (from now on all sums are from i = to N): N xi xid β xi x yi i xi x id β xi y i xij xij x ik β j = xij y i xid xid x i x id β d xid y i Now we write all the sums in terms of means and covariances: yi = Nȳ xij = N x j j =,..., d xij y i = (N )σ xj y + N x j ȳ j =,..., d xij x ik = (N )σ xj x k + N x j x k j =,..., d; k =,..., d where the last two lines are obtained by noting that: y y y N σ xj x k = N (xij x j )(x ik x k ) σ xj x k = N ( x ij x ik x ij x k x ik x j + x j x k ) σ xj x k = N ( x ij x ik N x j x k N x j x k + N x j x k ) (N )σ xj x k = x ij x ik N x j x k x ij x ik = (N )σ xj x k + N x j x k

22 Strohmann and Grudic If we plug these expressions into the linear system above we have: N N x N x d N x (N )σ x x + N x x (N )σ x x d + N x x d (N )σ xj x k + N x j x k N x d (N )σ xd x + N x d x (N )σ xd x d + N x d x d = Nȳ (N )σ x y + N x ȳ (N )σ xd y + N x d ȳ If we look at the first equation we retrieve the familiar formula for β : β β β d Nβ + d j= N x jβ j = Nȳ β = ȳ d j= β j x j Now look at the j + th (j =,..., d) equation which has x j as the common variable: N x j β + d k= β k[(n )σ xj x k + N x j x k ] = (N )σ xj y + N x j ȳ N x j (ȳ d k= β k x k ) + N d k= β k x j x k + (N ) d k= β kσ xj x k = (N )σ xj y + N x j ȳ N x j ȳ + (N ) d k= β kσ xj x k = (N )σ xj y + N x j ȳ (N ) d k= β kσ xj x k = (N )σ xj y d k= β kσ xj x k = σ xj y The last expression matches exactly the jth equation of (3). Since this is true for all j =,..., d we have shown that (3) / (4) describe the same system that appears for Least Squares problems. eferences M. A. Fischler and. C. Bolles. andom sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 4(6):38 395, 98. M. N. Gibbs. Bayesian Gaussian Processes for egression and Classification. PhD thesis, University of Cambridge, 997a. M. N. Gibbs. Tpros software: egression using gaussian processes, 997b. UL G.. G. Lanckriet, L. E. Ghaoui, C. Bhattacharyya, and M. I. Jordan. Minimax probability machine. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 4, Cambridge, MA, a. MIT Press.

23 obust Minimax Probability Machine egression G.. G. Lanckriet, L. E. Ghaoui, C. Bhattacharyya, and M. I. Jordan. A robust minimax approach to classification. Journal of Machine Learning esearch, 3:555 58, b. D. Mackay. Gaussian processes a replacement for supervised neural networks?, 997. Lecture notes for a tutorial at NIPS. A. W. Marshall and I. Olkin. Multivariate chebyshev inequalities. Annals of Mathematical Statistics, 3(4): 4, 96. I. Popescu and D. Bertsimas. Optimal inequalities in probability theory: A convex optimization approach. Technical eport TM6, INSEAD, Dept. Math. O.., Cambridge, Mass,. C. E. asmussen. Evaluation of Gaussian Processes and other Methods for Non-Linear egression. PhD thesis, University of Toronto, 996. B. Schölkopf, P. Bartlett, B. Smola, and. Williamson. Shrinking the tube: A new support vector regression algorithm. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems, Cambridge, MA, 999. MIT Press. B. Schölkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA,. B. Schölkopf, K.-K. Sung, C. J. C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik. Comparing support vector machines with gaussian kernels to radial basis function classiers. IEEE Transactions on Signal Processing, 45(): , 997. A.K. Jain S.J. audys. Small sample size effects in statistical pattern recognition: ecommendations for practicioners. IEEE Trans. Pattern Analysis and Machine Intelligence, 3(3):5 64, 99. A. Smola and B. Schölkopf. A tutorial on support vector regression. Technical eport NC-T-998-3, oyal Holloway College, 998. T.. Strohmann and G. Z. Grudic. A formulation for minimax probability machine regression. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 5, Cambridge, MA, 3. MIT Press. M. Tipping. The relevance vector machine. In S.A. Solla, T.K. Leen, and K.-. Müller, editors, Advances in Neural Information Processing Systems, Cambridge, MA,. MIT Press. D. Wakcerly, W. Mendenhall, and. Scheaffer. Mathematical Statistics with Applications, Sixth Edition. Duxbury Press, Pacific Grove, CA,. Christopher K. I. Williams and Carl Edward asmussen. Gaussian processes for regression. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, Cambridge, MA, 995. MIT Press. 3

Probabilistic Regression Using Basis Function Models

Probabilistic Regression Using Basis Function Models Probabilistic Regression Using Basis Function Models Gregory Z. Grudic Department of Computer Science University of Colorado, Boulder grudic@cs.colorado.edu Abstract Our goal is to accurately estimate

More information

Probabilistic Random Forests: Predicting Data Point Specific Misclassification Probabilities ; CU- CS

Probabilistic Random Forests: Predicting Data Point Specific Misclassification Probabilities ; CU- CS University of Colorado, Boulder CU Scholar Computer Science Technical Reports Computer Science Spring 5-1-23 Probabilistic Random Forests: Predicting Data Point Specific Misclassification Probabilities

More information

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,

More information

Convex Optimization in Classification Problems

Convex Optimization in Classification Problems New Trends in Optimization and Computational Algorithms December 9 13, 2001 Convex Optimization in Classification Problems Laurent El Ghaoui Department of EECS, UC Berkeley elghaoui@eecs.berkeley.edu 1

More information

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Data Mining Support Vector Machines Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 Support Vector Machines Find a linear hyperplane

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable

More information

Robust Kernel-Based Regression

Robust Kernel-Based Regression Robust Kernel-Based Regression Budi Santosa Department of Industrial Engineering Sepuluh Nopember Institute of Technology Kampus ITS Surabaya Surabaya 60111,Indonesia Theodore B. Trafalis School of Industrial

More information

Classifier Complexity and Support Vector Classifiers

Classifier Complexity and Support Vector Classifiers Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl

More information

Evaluation of Support Vector Machines and Minimax Probability. Machines for Weather Prediction. Stephen Sullivan

Evaluation of Support Vector Machines and Minimax Probability. Machines for Weather Prediction. Stephen Sullivan Generated using version 3.0 of the official AMS L A TEX template Evaluation of Support Vector Machines and Minimax Probability Machines for Weather Prediction Stephen Sullivan UCAR - University Corporation

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Lecture 10: Support Vector Machine and Large Margin Classifier

Lecture 10: Support Vector Machine and Large Margin Classifier Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu

More information

Support Vector Machine

Support Vector Machine Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

More information

Predicting the Probability of Correct Classification

Predicting the Probability of Correct Classification Predicting the Probability of Correct Classification Gregory Z. Grudic Department of Computer Science University of Colorado, Boulder grudic@cs.colorado.edu Abstract We propose a formulation for binary

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

ML (cont.): SUPPORT VECTOR MACHINES

ML (cont.): SUPPORT VECTOR MACHINES ML (cont.): SUPPORT VECTOR MACHINES CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 40 Support Vector Machines (SVMs) The No-Math Version

More information

Linear Dependency Between and the Input Noise in -Support Vector Regression

Linear Dependency Between and the Input Noise in -Support Vector Regression 544 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 14, NO. 3, MAY 2003 Linear Dependency Between the Input Noise in -Support Vector Regression James T. Kwok Ivor W. Tsang Abstract In using the -support vector

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Support Vector Machine for Classification and Regression

Support Vector Machine for Classification and Regression Support Vector Machine for Classification and Regression Ahlame Douzal AMA-LIG, Université Joseph Fourier Master 2R - MOSIG (2013) November 25, 2013 Loss function, Separating Hyperplanes, Canonical Hyperplan

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Tobias Pohlen Selected Topics in Human Language Technology and Pattern Recognition February 10, 2014 Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel

More information

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract Scale-Invariance of Support Vector Machines based on the Triangular Kernel François Fleuret Hichem Sahbi IMEDIA Research Group INRIA Domaine de Voluceau 78150 Le Chesnay, France Abstract This paper focuses

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Nonparametric Classification with Polynomial MPMC Cascades

Nonparametric Classification with Polynomial MPMC Cascades Nonparametric Classification with Polynomial MPMC Cascades Sander M. Bohte Department of Software Engineering, CWI, The Netherlands Department of Computer Science, University of Colorado at Boulder, USA

More information

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Lecture Notes on Support Vector Machine

Lecture Notes on Support Vector Machine Lecture Notes on Support Vector Machine Feng Li fli@sdu.edu.cn Shandong University, China 1 Hyperplane and Margin In a n-dimensional space, a hyper plane is defined by ω T x + b = 0 (1) where ω R n is

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Review: Support vector machines. Machine learning techniques and image analysis

Review: Support vector machines. Machine learning techniques and image analysis Review: Support vector machines Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Review: Support vector machines Margin optimization

More information

Support Vector Machine & Its Applications

Support Vector Machine & Its Applications Support Vector Machine & Its Applications A portion (1/3) of the slides are taken from Prof. Andrew Moore s SVM tutorial at http://www.cs.cmu.edu/~awm/tutorials Mingyue Tan The University of British Columbia

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Support Vector Machine (SVM) Hamid R. Rabiee Hadi Asheri, Jafar Muhammadi, Nima Pourdamghani Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Introduction

More information

A Tutorial on Support Vector Machine

A Tutorial on Support Vector Machine A Tutorial on School of Computing National University of Singapore Contents Theory on Using with Other s Contents Transforming Theory on Using with Other s What is a classifier? A function that maps instances

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Robust Novelty Detection with Single Class MPM

Robust Novelty Detection with Single Class MPM Robust Novelty Detection with Single Class MPM Gert R.G. Lanckriet EECS, U.C. Berkeley gert@eecs.berkeley.edu Laurent El Ghaoui EECS, U.C. Berkeley elghaoui@eecs.berkeley.edu Michael I. Jordan Computer

More information

CS798: Selected topics in Machine Learning

CS798: Selected topics in Machine Learning CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Support Vector Machines. Maximizing the Margin

Support Vector Machines. Maximizing the Margin Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs

Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs Ammon Washburn University of Arizona September 25, 2015 1 / 28 Introduction We will begin with basic Support Vector Machines (SVMs)

More information

SMO Algorithms for Support Vector Machines without Bias Term

SMO Algorithms for Support Vector Machines without Bias Term Institute of Automatic Control Laboratory for Control Systems and Process Automation Prof. Dr.-Ing. Dr. h. c. Rolf Isermann SMO Algorithms for Support Vector Machines without Bias Term Michael Vogt, 18-Jul-2002

More information

SUPPORT VECTOR MACHINE FOR THE SIMULTANEOUS APPROXIMATION OF A FUNCTION AND ITS DERIVATIVE

SUPPORT VECTOR MACHINE FOR THE SIMULTANEOUS APPROXIMATION OF A FUNCTION AND ITS DERIVATIVE SUPPORT VECTOR MACHINE FOR THE SIMULTANEOUS APPROXIMATION OF A FUNCTION AND ITS DERIVATIVE M. Lázaro 1, I. Santamaría 2, F. Pérez-Cruz 1, A. Artés-Rodríguez 1 1 Departamento de Teoría de la Señal y Comunicaciones

More information

Lecture 10: A brief introduction to Support Vector Machine

Lecture 10: A brief introduction to Support Vector Machine Lecture 10: A brief introduction to Support Vector Machine Advanced Applied Multivariate Analysis STAT 2221, Fall 2013 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Semi-Supervised Learning through Principal Directions Estimation

Semi-Supervised Learning through Principal Directions Estimation Semi-Supervised Learning through Principal Directions Estimation Olivier Chapelle, Bernhard Schölkopf, Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany {first.last}@tuebingen.mpg.de

More information

A Robust Minimax Approach to Classification

A Robust Minimax Approach to Classification A Robust Minimax Approach to Classification Gert R.G. Lanckriet Laurent El Ghaoui Chiranjib Bhattacharyya Michael I. Jordan Department of Electrical Engineering and Computer Science and Department of Statistics

More information

Introduction to SVM and RVM

Introduction to SVM and RVM Introduction to SVM and RVM Machine Learning Seminar HUS HVL UIB Yushu Li, UIB Overview Support vector machine SVM First introduced by Vapnik, et al. 1992 Several literature and wide applications Relevance

More information

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Machine Learning. Lecture 6: Support Vector Machine. Feng Li. Machine Learning Lecture 6: Support Vector Machine Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Warm Up 2 / 80 Warm Up (Contd.)

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Robust Fisher Discriminant Analysis

Robust Fisher Discriminant Analysis Robust Fisher Discriminant Analysis Seung-Jean Kim Alessandro Magnani Stephen P. Boyd Information Systems Laboratory Electrical Engineering Department, Stanford University Stanford, CA 94305-9510 sjkim@stanford.edu

More information

Linear Classification and SVM. Dr. Xin Zhang

Linear Classification and SVM. Dr. Xin Zhang Linear Classification and SVM Dr. Xin Zhang Email: eexinzhang@scut.edu.cn What is linear classification? Classification is intrinsically non-linear It puts non-identical things in the same class, so a

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot, we get creative in two

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lessons 6 10 Jan 2017 Outline Perceptrons and Support Vector machines Notation... 2 Perceptrons... 3 History...3

More information

Analysis of Multiclass Support Vector Machines

Analysis of Multiclass Support Vector Machines Analysis of Multiclass Support Vector Machines Shigeo Abe Graduate School of Science and Technology Kobe University Kobe, Japan abe@eedept.kobe-u.ac.jp Abstract Since support vector machines for pattern

More information

Support Vector Machines

Support Vector Machines EE 17/7AT: Optimization Models in Engineering Section 11/1 - April 014 Support Vector Machines Lecturer: Arturo Fernandez Scribe: Arturo Fernandez 1 Support Vector Machines Revisited 1.1 Strictly) Separable

More information

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning. Support Vector Machines. Manfred Huber Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data

More information

Support Vector Machines and Kernel Algorithms

Support Vector Machines and Kernel Algorithms Support Vector Machines and Kernel Algorithms Bernhard Schölkopf Max-Planck-Institut für biologische Kybernetik 72076 Tübingen, Germany Bernhard.Schoelkopf@tuebingen.mpg.de Alex Smola RSISE, Australian

More information

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI Support Vector Machines II CAP 5610: Machine Learning Instructor: Guo-Jun QI 1 Outline Linear SVM hard margin Linear SVM soft margin Non-linear SVM Application Linear Support Vector Machine An optimization

More information

Nonlinear Dimensionality Reduction

Nonlinear Dimensionality Reduction Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Kernel PCA 2 Isomap 3 Locally Linear Embedding 4 Laplacian Eigenmap

More information

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Reading: Ben-Hur & Weston, A User s Guide to Support Vector Machines (linked from class web page) Notation Assume a binary classification problem. Instances are represented by vector

More information

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing Kernel PCA Pattern Reconstruction via Approximate Pre-Images Bernhard Scholkopf, Sebastian Mika, Alex Smola, Gunnar Ratsch, & Klaus-Robert Muller GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany fbs,

More information

Pattern Recognition 2018 Support Vector Machines

Pattern Recognition 2018 Support Vector Machines Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48 Support Vector Machines Ad Feelders ( Universiteit Utrecht

More information

Perceptron Revisited: Linear Separators. Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department

More information

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers Computational Methods for Data Analysis Massimo Poesio SUPPORT VECTOR MACHINES Support Vector Machines Linear classifiers 1 Linear Classifiers denotes +1 denotes -1 w x + b>0 f(x,w,b) = sign(w x + b) How

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information

Statistical Methods for SVM

Statistical Methods for SVM Statistical Methods for SVM Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot,

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines: Maximum Margin Classifiers Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1 Outline What is behind

More information

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Support Vector Machines CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification

More information

(Kernels +) Support Vector Machines

(Kernels +) Support Vector Machines (Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning

More information

L5 Support Vector Classification

L5 Support Vector Classification L5 Support Vector Classification Support Vector Machine Problem definition Geometrical picture Optimization problem Optimization Problem Hard margin Convexity Dual problem Soft margin problem Alexander

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction

More information

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION International Journal of Pure and Applied Mathematics Volume 87 No. 6 2013, 741-750 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu doi: http://dx.doi.org/10.12732/ijpam.v87i6.2

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

A Robust Minimax Approach to Classification

A Robust Minimax Approach to Classification A Robust Minimax Approach to Classification Gert R.G. Lanckriet gert@cs.berkeley.edu Department of Electrical Engineering and Computer Science University of California, Berkeley, CA 94720, USA Laurent

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 68 November 999 C.B.C.L

More information

Support Vector Machine Regression for Volatile Stock Market Prediction

Support Vector Machine Regression for Volatile Stock Market Prediction Support Vector Machine Regression for Volatile Stock Market Prediction Haiqin Yang, Laiwan Chan, and Irwin King Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin,

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods Support Vector Machines and Kernel Methods Geoff Gordon ggordon@cs.cmu.edu July 10, 2003 Overview Why do people care about SVMs? Classification problems SVMs often produce good results over a wide range

More information