NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION

Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function Networks Sparse Kernel Machines Nonlinear SVMs and the Kernel Trick Relevance Vector Machines

Nonlinear Classification and Regression: Outline 3 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function Networks Sparse Kernel Machines Nonlinear SVMs and the Kernel Trick Relevance Vector Machines

Implementing Logical Relations 4 AND and OR operations are linearly separable problems

The XOR Problem 5 XOR is not linearly separable. x 1 x 2 XOR Class 0 0 0 B 0 1 1 A 1 0 1 A 1 1 0 B How can we use linear classifiers to solve this problem?

Combining two linear classifiers 6 Idea: use a logical combination of two linear classifiers. g 2 (x) = x 1 + x 2! 3 2 g 1 (x) = x 1 + x 2! 1 2

Combining two linear classifiers 7 Let f (x) be the unit step activation function: f (x) = 0, x < 0 f (x) = 1, x! 0 Observe that the classification problem is then solved by " f y 1! y 2! 1 % $ # 2 ' & where y 1 = f g 1 (x) ( ) and y 2 = f ( g 2 (x)) g 2 (x) = x 1 + x 2! 3 2 g 1 (x) = x 1 + x 2! 1 2

Combining two linear classifiers 8 This calculation can be implemented sequentially: 1. Compute y 1 and y 2 from x 1 and x 2. 2. Compute the decision from y 1 and y 2. Each layer in the sequence consists of one or more linear classifications. This is therefore a two-layer perceptron. " f y 1! y 2! 1 % $ # 2 ' & where y 1 = f g 1 (x) ( ) and y 2 = f ( g 2 (x)) g 2 (x) = x 1 + x 2! 3 2 g 1 (x) = x 1 + x 2! 1 2

The Two-Layer Perceptron 9 Layer 1 Layer 2 x 1 x 2 y 1 y 2 0 0 0(-) 0(-) B(0) 0 1 1(+) 0(-) A(1) 1 0 1(+) 0(-) A(1) 1 1 1(+) 1(+) B(0) " f y 1! y 2! 1 % $ # 2 ' & where y 1 = f g 1 (x) ( ) and y 2 = f ( g 2 (x)) g 2 (x) = x 1 + x 2! 3 2 Layer 1 Layer 2 y 1! y 2! 1 2 g 1 (x) = x 1 + x 2! 1 2

The Two-Layer Perceptron 10 The first layer performs a nonlinear mapping that makes the data linearly separable. y 1 = f ( g 1 (x)) and y 2 = f ( g 2 (x)) g 2 (x) = x 1 + x 2! 3 2 Layer 1 Layer 2 y 1! y 2! 1 2 g 1 (x) = x 1 + x 2! 1 2

The Two-Layer Perceptron Architecture 11 Input Layer Hidden Layer Output Layer g 1 (x) = x 1 + x 2! 1 2 y 1! y 2! 1 2 g 2 (x) = x 1 + x 2! 3 2-1

The Two-Layer Perceptron 12 Note that the hidden layer maps the plane onto the vertices of a unit square. y 1 = f ( g 1 (x)) and y 2 = f ( g 2 (x)) g 2 (x) = x 1 + x 2! 3 2 Layer 1 Layer 2 y 1! y 2! 1 2 g 1 (x) = x 1 + x 2! 1 2

Higher Dimensions 13 Each hidden unit realizes a hyperplane discriminant function. The output of each hidden unit is 0 or 1 depending upon the location of the input vector relative to the hyperplane. l x R x { 0,1} i 1, 2 p T y = [ y 1,... yp ], yi =,...

Higher Dimensions 14 Together, the hidden units map the input onto the vertices of a p-dimensional unit hypercube. l x R x y { 0,1} i 1, 2 p T = [ y 1,... yp ], yi =,...

Two-Layer Perceptron 15 These p hyperplanes partition the l-dimensional input space into polyhedral regions Each region corresponds to a different vertex of the p- dimensional hypercube represented by the outputs of the hidden layer.

Two-Layer Perceptron 16 In this example, the vertex (0, 0, 1) corresponds to the region of the input space where: g 1 (x) < 0 g 2 (x) < 0 g 3 (x) > 0

Limitations of a Two-Layer Perceptron 17 The output neuron realizes a hyperplane in the transformed space that partitions the p vertices into two sets. Thus, the two layer perceptron has the capability to classify vectors into classes that consist of unions of polyhedral regions. But NOT ANY union. It depends on the relative position of the corresponding vertices. How can we solve this problem?

The Three-Layer Perceptron 18 Suppose that Class A consists of the union of K polyhedra in the input space. Use K neurons in the 2 nd hidden layer. Train each to classify one region as positive, the rest negative. Now use an output neuron that implements the OR function.

The Three-Layer Perceptron 19 Thus the three-layer perceptron can separate classes resulting from any union of polyhedral regions in the input space.

The Three-Layer Perceptron 20 The first layer of the network forms the hyperplanes in the input space. The second layer of the network forms the polyhedral regions of the input space The third layer forms the appropriate unions of these regions and maps each to the appropriate class.

Learning Parameters

Training Data 22 The training data consist of N input-output pairs: ( y(i), x(i) ), i! 1, N where y(i) = and x(i) = " y 1 (i),, y kl (i) #$ " x 1 (i),, x k0 (i) #$ t % &' t % &'

Choosing an Activation Function 23 The unit step activation function means that the error rate of the network is a discontinuous function of the weights. This makes it difficult to learn optimal weights by minimizing the error. To fix this problem, we need to use a smooth activation function. A popular choice is the sigmoid function we used for logistic regression:

Smooth Activation Function 24 f (a) = 1 1+ exp(!a)!" w t! x 1

Output: Two Classes 25 For a binary classification problem, there is a single output node with activation function given by f (a) = 1 1+ exp(!a) Since the output is constrained to lie between 0 and 1, it can be interpreted as the probability of the input vector belonging to Class 1.

Output: K > 2 Classes 26 For a K-class problem, we use K outputs, and the softmax function given by y k =! j ( ) exp a k ( ) exp a j Since the outputs are constrained to lie between 0 and 1, and sum to 1, y k can be interpreted as the probability that the input vector belongs to Class K.

Non-Convex 27 Now each layer of our multi-layer perceptron is a logistic regressor. Recall that optimizing the weights in logistic regression results in a convex optimization problem. Unfortunately the cascading of logistic regressors in the multi-layer perceptron makes the problem non-convex. This makes it difficult to determine an exact solution. Instead, we typically use gradient descent to find a locally optimal solution to the weights. The specific learning algorithm is called the backpropagation algorithm.

End of Lecture Nov 21, 2011

Nonlinear Classification and Regression: Outline 29 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function Networks Sparse Kernel Machines Nonlinear SVMs and the Kernel Trick Relevance Vector Machines

The Backpropagation Algorithm Paul J. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, 1974 Rumelhart, David E.; Hinton, Geoffrey E., Williams, Ronald J. (8 October 1986). "Learning representations by back-propagating errors". Nature 323 (6088): 533 536. Werbos Rumelhart Hinton

Notation 31 Assume a network with L layers k 0 nodes in the input layer. k r nodes in the r th layer.

Notation 32 Let y k r!1 be the output of the kth neuron of layer r! 1. r Let w jk be the weight of the synapse on the jth neuron of layer r from the kth neuron of layer r! 1.

Input 33 y k 0 (i) = x k (i), k = 1,,k 0

Notation 34 Let v j r v j r (i) = w j r be the total input to the jth neuron of layer r : ( ) t k r!1 y r!1 (i) = " w jk Then y j r (i) = f v j r (i) k =0 r y k r!1 (i) where we define y 0 r (i) = +1 to incorporate the bias term. k r!1 ( ) = f %" w jk # $ k =0 & r r y!1 k (i)( '

Cost Function 35 A common cost function is the squared error: J = N " i =1!(i) where!(i)! 1 2 and k L " ( e m (i)) 2 = 1 2 m =1 k L " ( y m (i) # ŷ m (i)) 2 m =1 ŷ m (i) = y k r (i) is the output of the network.

Cost Function 36 To summarize, the error for input i is given by!(i) = 1 2 where ŷ m (i) = y r k (i) is the output of the output layer and each layer is related to the previous layer through and k L " ( e m (i)) 2 = 1 2 m =1 ( ) y j r (i) = f v j r (i) v r j (i) = ( r w ) t j y r!1 (i) k L " ( ŷ m (i) # y m (i)) 2 m =1

Gradient Descent 37 Gradient descent starts with an initial guess at the weights over all layers of the network. We then use these weights to compute the network output ŷ(i) for each input vector x(i) in the training data.!(i) = 1 2 k L " ( e m (i)) 2 = 1 2 m =1 k L " ( ŷ m (i) # y m (i)) 2 m =1 This allows us to calculate the error (i) for each of these inputs. Then, in order to minimize this error, we incrementally update the weights in the negative gradient direction: w r j (new) = w r j (old) - µ!j = w r r!w j (old) - µ j N # i =1!"(i)!w j r

Gradient Descent 38 ( ) t y r!1 (i) Since v r r j (i) = w j, the influence of the jth weight of the rth layer on the error can be expressed as:!"(i)!w j r =!"(i)!v j r (i)!v j r (i)!w j r = # j r (i)y r $1 (i) where # j r (i)!!"(i)!v j r (i)

Gradient Descent 39!"(i) = # r r!w j (i)y r $1 (i), j where # j r (i)!!"(i)!v j r (i) For an intermediate layer r, we cannot compute! r j (i) directly. However,! r j (i) can be computed inductively, starting from the output layer.

Backpropagation: The Output Layer 40!"(i) = # r r!w j (i)y r $1 (i), j where # j r (i)!!"(i)!v j r (i) and!(i) = 1 2 k L " ( e m (i)) 2 = 1 2 m =1 k L " m =1 Recall that ŷ m (i) = y L j (i) = f ( v L j (i)) Thus at the output layer we have! j L (i) = "#(i) "v j L (i) = "#(i) "e j L (i) ( ŷ m (i) # y m (i)) 2 ( ) "e L j (i) "v L j (i) = e L (i) f $ v L j j (i) 1 f (a) = 1+ exp(!a) " f # (a) = f (a) 1! f (a)! j L (i) = e j L (i)f v j L (i) ( ) ( ) ( ) 1 " f ( v L j (i))

Backpropagation: Hidden Layers 41 Observe that the dependence of the error on the total input to a neuron in a previous layer can be expressed in terms of the dependence on the total input of neurons in the following layer: r! "1 j (i) = #$(i) k r #v "1 j (i) = #$(i) #v r r k k (i) % =! r #v r r k (i) #v "1 j (i) k (i) #v r r (i) k % r #v "1 j (i) where v k r (i) = k r!1 k =1 k r!1 w r " km y!1 m (i) = " w km m =0 m =0 ( ) Thus we have!v r (i) k r!v "1 j (i) = w r r f # v "1 kj j (i) and so! j r "1 (i) = #$(i) r #v "1 j (i) = f % k =1 ( ) r f v m r!1 (i) k v r r ( "1 (i) j )! r r & k (i)w kj ( ) 1 " f ( v L j (i)) = f v j L (i) k r ( )! r r & k (i)w kj Thus once the! k r (i) are determined they can be propagated backward to calculate! j r "1 (i) using this inductive formula. k =1 k =1

42 Repeat until convergence Backpropagation: Summary of Algorithm 1. Initialization Initialize all weights with small random values 2. Forward Pass For each input vector, run the network in the forward direction, calculating: 3. Backward Pass r Starting with the output layer, use our inductive formula to compute the! "1 j (i) : v j r (i) = w j r ( ) t y r!1 (i); and finally!(i) = 1 2 Output Layer (Base Case): 4. Update Weights y j r (i) = f v j r (i) k L " ( e m (i)) 2 = 1 2 m =1 Hidden Layers (Inductive Case):! L j (i) = e L j (i) f " r! "1 j (i) = f # ( v L j (i)) k r r ( v "1 j (i))! r r $ k (i)w kj N!"(i) w r j (new) = w r j (old) - µ # where!"(i) = # r r!w j (i)y r $1 (i) j i =1!w j r ( ) k L " ( ŷ m (i) # y m (i)) 2 m =1 k =1

Batch vs Online Learning 43 As described, on each iteration backprop updates the weights based upon all of the training data. This is called batch learning. N!"(i) w r j (new) = w r j (old) - µ # where!"(i) = # r r!w j (i)y r $1 (i) j i =1!w j r An alternative is to update the weights after each training input has been processed by the network, based only upon the error for that input. This is called online learning. w j r (new) = w j r (old) - µ!"(i)!w j r where!"(i) = # r r!w j (i)y r $1 (i) j

Batch vs Online Learning 44 One advantage of batch learning is that averaging over all inputs when updating the weights should lead to smoother convergence. On the other hand, the randomness associated with online learning might help to prevent convergence toward a local minimum. Changing the order of presentation of the inputs from epoch to epoch may also improve results.

Remarks 45 Local Minima The objective function is in general non-convex, and so the solution may not be globally optimal. Stopping Criterion Typically stop when the change in weights or the change in the error function falls below a threshold. Learning Rate The speed and reliability of convergence depends on the learning rate.

Nonlinear Classification and Regression: Outline 46 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function Networks Sparse Kernel Machines Nonlinear SVMs and the Kernel Trick Relevance Vector Machines

Generalizing Linear Classifiers 47 One way of tackling problems that are not linearly separable is to transform the input in a nonlinear fashion prior to applying a linear classifier. The result is that decision boundaries that are linear in the resulting feature space may be highly nonlinear in the original input space. 1 1 x 2 φ 2 0 0.5 1 0 1 0 1 x 1 Input Space 0 0.5 1 φ 1 Feature Space

Nonlinear Basis Function Models 48 Generally where ϕ j (x) are known as basis functions. Typically, Φ 0 (x) = 1, so that w 0 acts as a bias.

49 Nonlinear basis functions for classification In the context of classification, the discriminant function in the feature space becomes: g y (x) M ( ) = w 0 + w i y i (x)! = w 0 +! w i " i (x) i =1 M i =1 This formulation can be thought of as an input space approximation of the true separating discriminant function g(x) using a set of interpolation functions! i (x).

Dimensionality 50 The dimensionality M of the feature space may be less than, equal to, or greater than the dimensionality D of the original input space. M < D: This may result in a factoring out of irrelevant dimensions, reduction in the number of model parameters, and resulting improvement in generalization (reduced overlearning). M > D: Problems that are not linearly separable in the input space may become separable in the feature space, and the probability of linear separability generally increases with the dimensionality of the feature space. Thus choosing M >> D helps to make the problem linearly separable.

Cover s Theorem 51 A complex pattern-classification problem, cast in a high-dimensional space nonlinearly, is more likely to be linearly separable than in a low-dimensional space, provided that the space is not densely populated. Cover, T.M., Geometrical and Statistical properties of systems of linear inequalities with applications in pattern recognition., 1965 Example

Nonlinear Classification and Regression: Outline 52 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function Networks Sparse Kernel Machines Nonlinear SVMs and the Kernel Trick Relevance Vector Machines

Radial Basis Functions 53 Consider interpolation functions (kernels) of the form! ( i x " µ ) i In other words, the feature value depends only upon the Euclidean distance to a centre point in the input space. A commonly used RBF is the isotropic Gaussian: $! i (x) = exp & " 1 2 2# x " µ i % i 2 ' ) (

Relation to KDE 54 We can use Gaussian RBFs to approximate the discriminant function g(x): ( ) = w 0 + w i y i (x) g y (x) M! = w 0 +! w i " i (x) i =1 M i =1 where $! i (x) = exp & " 1 2 2# x " µ i % i 2 ' ) ( This is reminiscent of kernel density estimation, where we approximated probability densities as a normalized sum of Gaussian kernels.

Relation to KDE 55 For KDE we planted a kernel at each data point. Thus there were N kernels. For RBF networks, we generally use far fewer kernels than the number of data points: M << N. This leads to greater efficiency and generalization.

RBF Networks 56 The Linear Classifier with nonlinear radial basis functions can be considered an artificial neural network where The hidden nodes are nonlinear (e.g., Gaussian). The output node is linear. RBF Network for 2 Classes

RBF Networks vs Perceptrons 57 Recall that for a perceptron, the output of a hidden unit is invariant on a hyperplane. For an RBF, the output of a hidden unit is invariant on a circle centred on i. Thus hidden units are global in a perceptron, but local in an RBF network. RBF Network for 2 Classes

RBF Networks vs Perceptrons 58 This difference has consequences: Multilayer perceptrons tend to learn slower than RBFs. However, multilayer perceptrons tend to have better generalization properties, especially in regions of the input space where training data are sparse. Typically, more neurons are needed for an RBF than for a multilayer perceptron to solve a given problem.

Parameters 59 There are two options for choosing the parameters (centres and scales) of the RBFs: 1. Fixed. For example, randomly select a subset of M of the input vectors and use these as centres. Use a common scale based upon your judgement. 2. Learned. Note that when the RBF parameters are fixed, the weights could be learned using linear classifier techniques (e.g., least squares). Thus the RBF parameters could be learned in an outer loop, by gradient descent.

Nonlinear Classification and Regression: Outline 60 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function Networks Sparse Kernel Machines Nonlinear SVMs and the Kernel Trick Relevance Vector Machines

The Kernel Function 61 Recall that an SVM is the solution to the problem Maximize L(a)! =! a n " 1 2 n=1 subject to 0 # a n # C and N N N!! n=1 m=1 N! a n t n = 0 n=1 a n a m t n t m k ( x n,x m ) A new input x is classified by computing y(x) = " a n t n k(x,x n ) + b n!s Where S is the set of support vectors. Here we introduced the kernel function k(x, x ), defined as ( ) = "(x) t "(! k x, x! x ) This is more than a notational convenience!!

The Kernel Trick 62 Maximize! L(a) = N! a n " 1 n=1 2 subject to 0 # a n # C and where k ( x, x! ) = "(x) t "( x!) N N!! n=1 m=1 N! a n t n = 0 n=1 a n a m t n t m k ( x n,x m ) Note that the basis functions and individual training vectors are no longer part of the objective function. Instead all we need is the kernel value (like a distance measure) for all pairs of training vectors.

The Kernel Function 63 The kernel function k(x, x!) measures the 'similarity' of input vectors x and x! as an inner product in a feature space defined by the feature space mapping "(x) : k(x, x!) = "(x) t "( x!) If k(x, x!) = k(x " x!) we say that the kernel is stationary If k(x, x!) = k x " x! ( ) we call it a radial basis function.

Constructing Kernels 64 We can construct a kernel by selecting a feature space mapping ϕ(x) and then defining k(x, x!) = "(x) t "( x!) 1 Gaussian!( x ") 0.75 0.5 0.25 0!1 0 1 2.0 k(x, x!) 1.0 0.0!1 0 1 x!

Constructing Kernels 65 Alternatively, we can construct the kernel function directly, ensuring that it corresponds to an inner product in some (possibly infinite-dimensional) feature space.

Constructing Kernels 66 k(x) =!(x) t!( x ") Example 1: k(x,z) = x t z Example 2: k(x,z) = x t z + c, c > 0 Example 3: k(x,z) = ( x t z) 2

Kernel Properties 67 Kernels obey certain properties that make it easy to construct complex kernels from simpler ones.

Kernel Properties 68 Given valid kernels k 1 (x, x ) and k 2 (x, x ) the following kernels will also be valid: k(x, x ) = ck 1 (x, x ) (6.13) k(x, x ) = f(x)k 1 (x, x )f(x ) (6.14) k(x, x ) = q(k 1 (x, x )) (6.15) k(x, x ) = exp(k 1 (x, x )) (6.16) k(x, x ) = k 1 (x, x )+k 2 (x, x ) (6.17) k(x, x ) = k 1 (x, x )k 2 (x, x ) (6.18) k(x, x ) = k 3 (φ(x), φ(x )) (6.19) k(x, x ) = x T Ax (6.20) k(x, x ) = k a (x a, x a)+k b (x b, x b ) (6.21) k(x, x ) = k a (x a, x a)k b (x b, x b ) (6.22) where c > 0, f (!) is any function, q(!) is a polynomial with nonnegative coefficients, "(x) is a mapping from x #! M, k 3 is a valid kernel on! M, A is a symmetric positive semidefinite matrix, x a and x b are variables such that x = x a,x b and k a,k b are valid kernels over their respective spaces. ( )

Constructing Kernels 69 Examples: ( ) M,c > 0 k(x, x!) = x t x! + c (Use 6.18) ( ) k(x, x!) = exp " x " x! 2 / 2# 2 (Use 6.14 and 6.16.) Corresponds to infinite-dimensional feature vector

70 Nonlinear SVM Example (Gaussian Kernel) Input Space x 2 x 1

SVMs for Regression 71 In standard linear regression, we minimize N 1 "( y 2 n! t n ) 2 + # 2 w 2 n=1 This penalizes all deviations from the model. To obtain sparse solutions, we replace the quadratic error function by an!-insensitive error function, e.g., E! # % ( y(x) " t) = $ & % 0, if y(x) - t <! y(x) - t "!, otherwise See text for details of solution. y(x) ξ > 0 ξ > 0 y + ɛ y y ɛ x

Example 72 1 t 0!1 0 x 1

Nonlinear Classification and Regression: Outline 73 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function Networks Sparse Kernel Machines Nonlinear SVMs and the Kernel Trick Relevance Vector Machines

Relevance Vector Machines 74 Some drawbacks of SVMs: Do not provide posterior probabilities. Not easily generalized to K > 2 classes. Parameters (C, ) must be learned by cross-validation. The Relevance Vector Machine is a sparse Bayesian kernel technique that avoids these drawbacks. RVMs also typically lead to sparser models.

RVMs for Regression 75 ( ) = N t y(x),! "1 p t x,w,! where y(x) = w t!(x) ( ) In an RVM, the basis functions!(x) are kernels k ( x,x ) n : y(x) = N " w n k ( x,x ) n + b n=1 However, unlike in SVMs, the kernels need not be positive definite, and the x n need not be the training data points.

RVMs for Regression 76 Likelihood: N ( ) = p ( t n x n,w,!) p t X,w,! " n=1 where the n th row of X is x n t. Prior: M # p(w!) = N w i 0,! i "1 i =1 ( ) Note that each weight parameter has its own precision hyperparameter.

RVMs for Regression 77 "1 p(w i! i ) = N ( w i 0,! i ) Gaussian prior Marginal prior: single! Independent! p (! i ) = Gam (! i a,b) w 2 p( w i ) = St( w i 2a) w 1 The conjugate prior for the precision of a Gaussian is a gamma distribution. Integrating out the precision parameter leads to a Student s t distribution over w i. Thus the distribution over w is a product of Student s t distributions. As a result, maximizing the evidence will yield a sparse w. Note that to achieve sparsity it is critical that each parameter w i has a separate precision i.

78 RVMs for Regression "1 p(w i! i ) = N ( w i 0,! i ) Gaussian prior Marginal prior: single! Independent! p (! i ) = Gam (! i a,b) w 2 p( w i ) = St( w i 2a) Gamma Distribution: p ( x a,b ) = b a!(a) x a"1 e "bx. Also recall the rule for transforming densities: If y is a monotonic function of x, then w 1 Student's t Distribution: p ( x! ) = # "! + 1 & $ % 2 ' ( #!) "! & $ % 2' ( # 1+ x 2 & $ %! ' ( *! +1 2. p Y (y) = p X (x ) dx dy Thus if we let a! 0,b! 0, then #1 ( )! uniform and p ( w i )! w i. p log" i Very sparse!!

RVMs for Regression 79 Likelihood: N ( ) = p ( t n x n,w,!) p t X,w,! " n=1 where the n th row of X is x n t. Prior: M # p(w!) = N w i 0,! i "1 i =1 ( ) In practice, it is difficult to integrate out exactly. Instead, we use an approximate maximum likelihood method, finding ML values for each i. When we maximize the evidence with respect to these hyperparameters, many will. As a result, the corresponding weights will 0, yielding a sparse solution.

RVMs for Regression 80 Since both the likelihood and prior are normal, the posterior over w will also be normal: Posterior: ( ) = N ( w m,# ) p w t,x,!," where m = "#$ t t ( ) %1 # = A + "$ t $ and ( ) ( ) $ ni = & i x n A = diag! i Note that when! i " #, the i th row and column of $ " 0, and ( ) = N ( w i 0,0) p w i t,x,!,"

RVMs for Regression 81 The values for and are determined using the evidence approximation, where we maximize ( ) = p ( t X,w," ) p w! p t X,!," # ( )d w In general, this results in many of the precision parameters! i " #, so that w i " 0. Unfortunately, this is a non-convex problem.

Example 82 1 t 0!1 0 x 1