Linear Learning Machines

Similar documents
Brief Introduction to Machine Learning

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear Discrimination Functions

Linear discriminant functions

Single layer NN. Neuron Model

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

The Perceptron Algorithm 1

The Perceptron. Volker Tresp Summer 2016

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

Linear & nonlinear classifiers

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Linear & nonlinear classifiers

3.4 Linear Least-Squares Filter

Support Vector Machines for Classification and Regression

The Perceptron Algorithm

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

COMP 875 Announcements

The Perceptron algorithm

Simple Neural Nets For Pattern Classification

Max Margin-Classifier

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

SGN (4 cr) Chapter 5

Support Vector Machines for Classification: A Statistical Portrait

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Logistic Regression Logistic

a b = a T b = a i b i (1) i=1 (Geometric definition) The dot product of two Euclidean vectors a and b is defined by a b = a b cos(θ a,b ) (2)

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Input layer. Weight matrix [ ] Output layer

Machine Learning A Geometric Approach

Linear Regression. Volker Tresp 2014

The Perceptron. Volker Tresp Summer 2018

Feedforward Neural Nets and Backpropagation

In the Name of God. Lecture 11: Single Layer Perceptrons

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Linear Models for Classification

The Perceptron. Volker Tresp Summer 2014

ECE521 week 3: 23/26 January 2017

Linear Regression (continued)

Lecture Support Vector Machine (SVM) Classifiers

Machine Learning: The Perceptron. Lecture 06

Support Vector Machines

Jeff Howbert Introduction to Machine Learning Winter

Lecture 4: Linear predictors and the Perceptron

Linear Regression. S. Sumitra

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

CSC242: Intro to AI. Lecture 21

Unit 8: Introduction to neural networks. Perceptrons

Warm up: risk prediction with logistic regression

Neural Networks Lecture 2:Single Layer Classifiers

Linear Classifiers. Michael Collins. January 18, 2012

Homework 4. Convex Optimization /36-725

Machine Learning Practice Page 2 of 2 10/28/13

COMP9444 Neural Networks and Deep Learning 2. Perceptrons. COMP9444 c Alan Blair, 2017

CS145: INTRODUCTION TO DATA MINING

Multiclass Classification-1

Linear Neural Networks

Machine Learning 2017

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Introduction to Neural Networks

Computational Intelligence Lecture 3: Simple Neural Networks for Pattern Classification

Linear Classifiers and the Perceptron Algorithm

Evaluation requires to define performance measures to be optimized

Chapter 2 Single Layer Feedforward Networks

Midterm: CS 6375 Spring 2015 Solutions

Learning Methods for Linear Detectors

GRADIENT DESCENT AND LOCAL MINIMA

PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Single Layer Percetron

Overfitting, Bias / Variance Analysis

Neural networks. Chapter 20, Section 5 1

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Artificial Neural Networks

Linear Regression. Volker Tresp 2018

Optimization and Gradient Descent

Learning with multiple models. Boosting.

ESS2222. Lecture 4 Linear model

CSC321 Lecture 4 The Perceptron Algorithm

Binary Classification / Perceptron

More about the Perceptron

Logistic Regression. COMP 527 Danushka Bollegala

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

MLCC 2017 Regularization Networks I: Linear Models

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Support Vector and Kernel Methods

LECTURE NOTE #8 PROF. ALAN YUILLE. Can we find a linear classifier that separates the position and negative examples?

Evaluation. Andrea Passerini Machine Learning. Evaluation

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Support Vector Machines and Kernel Methods

ADALINE for Pattern Classification

1 Machine Learning Concepts (16 points)

Transcription:

Linear Learning Machines Chapter 2 February 13, 2003 T.P. Runarsson (tpr@hi.is) and S. Sigurdsson (sven@hi.is)

Linear learning machines February 13, 2003 In supervised learning, the learning machine is given a training set of examples (inputs) with associated labels (output values). S = ( (x 1, y 1 ), (x 2, y 2 ),..., (x l, y l ) ) (X Y ) l (l denotes number of training samples, x i are examples or instances, and y i their labels). A training set S is said to be trivial if all labels are equal. Usually the input space is a subset of the real value space, X R n (n is the dimension of the input space). The input x = (x 1, x 2,..., x n ) is a vector length n ( denotes matrix transposition). Linear functions are probably the best understood and simplest hypothesis. A learning machine using a hypothesis that forms linear combinations of the input variables is known as a linear learning machine. 1

Linear classification February 13, 2003 A linear function f(x) is frequently used for binary classification, y { 1, 1}, as follows: where assign x = (x 1, x 2,..., x n ) to ve class if f(x) 0, otherwise assign to the ve class, f(x) = n w i x i b = w x b i=1 ( denotes the inner product). PSfrag replacements b w w A separating hyperplane (w, b) R n R for a 2D training set. 2

Linear classification - geometric interpretation PSfrag replacements b w w x i The vector w defines a direction perpendicular to the hyperplane (the dark line). The value of b moves the hyperplane parallel to itself, this value is sometimes called the bias (or threshold) and is necessary if all hyperplanes are to be represented in R n (the number of free parameters is now n 1). Recall that the perpendicular Euclidean distance from a point x i to the hyperplane is: n i=1 w ix i b w xi b =, w w where w = w w. 3

Rosenblatt s Perceptron February 13, 2003 Both statisticians and neural network researchers have used linear classifiers: the theory of linear discriminants by Fisher in 1936, and then perceptrons by Rosenblatt in 1956. Rosenblatt s algorithm was the first iterative procedure for learning linear classification. The algorithm is: on-line, starts with an initial connection weight vector w = 0 (the all zero vector), mistake driven, it only adapt the weights when a classification mistake is made. The algorithm converges: when all the training data is correctly classified, this requires the data to be linearly separable, does this in a finite time. 4

The neuron and perceptron analogy February 13, 2003 Synapse from other neurons Dendrites to other neurons Nucleus Body Axon Dendrites Synapse to other neurons PSfrag replacements x 1 x 2 w 1 w 2 Connection Node f( w x b) node output x n Input nodes w n (the threshold of the output node would be set to 0 and an extra input with a constant value of 1 added and the corresponding weight would be the bias b) 5

Linear separability February 13, 2003 In the 1960 Rosenblatt s work (plus some hype) spurred a huge growth of research and corresponding financial investments in this area. In 1969, however, Minsky and Papert published a book titled Perceptrons (they were working in symbol-processing AI a competitor to the perceptron approach). The book presented and condemning look at perceptrons and as a result funding was blocked for more than 10 years! Field revived by Hopfield 1982 and Rumelhart & McClelland 1986. PSfrag replacements AND XOR 6

The perceptron algorithm (primal form) February 13, 2003 Given a linearly separable training set S and learning rate η R w 0 = 0, b 0 = 0, k = 0, R = max 1 i l x i, (y { 1, 1}) repeat for i = 1 to l if y i ( w k x i bk ) 0 then w k1 w k ηy i x i b k1 b k ηy i R 2 k k 1 end if end for until no mistake made within the for loop return (w k, b k ) where k is the number of mistakes. Note that the contribution of example i to the weight change is: α i ηy i x i where α i is the number of times x i has been misclassified (i.e. k = l i=1 α i). Therefore we may write: w = l α i y i x i i=1 assuming the initial weights are zero. The learning rate η only changes the scaling of the hyperplane and so is not really needed. 7

Perceptron (primal) numerical example The following are the first few steps of the perceptron algorithm (primal form), for the OR problem: sample_id x(1) x(2) y 1 0 0-1 2 1 0 1 3 0 1 1 4 1 1 1 We start with w (0) = 0, b (0) = 0, k = 0 and examine each example in turn, for the first sample x = (x(1), x(2)) = (0, 0) and y = 1: f(x) = w (0) (1) x(1)w (0) (2) x(2)b (0) = 0 00 00 = 0 but it should be ve because y = 1 for x = (0, 0) and so we must do an update: w (k1) (1) = w (k) (1) η y x(1) = 0 1 1 0 = 0 here we have chosen η = 1. Similarily: w (k1) (2) = w (k) (2) η y x(2) = 0 1 1 0 = 0 and finally we update the bias: b (k1) = b (k) η y R 2 = 0 1 1 2 2 = 2 here R is the largest norm of the input vectors: x 1 = 0, x 2 = 1, x 3 = 1, x 4 = 2 that is R = 2, Finally we update our counter k = k 1. 8

Now we have still w (1) = 0 but b (1) = 2 lets examine the second sample x = (x(1), x(2)) = (1, 0) and y = 1: f(x) = w (1) (1) x(1)w (1) (2) x(2)b (1) = 0 10 0 2 = 2 but it should be ve because y = 1 for x = (1, 0) and so we must do an update: and: w (k1) (1) = w (k) (1) η y x(1) = 0 1 1 1 = 1 w (k1) (2) = w (k) (2) η y x(2) = 0 1 1 0 = 0 and finally the bias: b (k1) = b (k) η y R 2 = 2 1 1 2 2 = 0 update our counter k = k 1 again. Continue looping through the examples until no more classification mistakes are made. 9

Multi-class perceptron February 13, 2003 When we have m different classes simply create m perceptrons! For example, say there are m = 3 classes and two inputs (n = 2), and we are given the following training set: S = ( ([0, 0], [ ]), ([1, 0], [ ]), ([0, 1], [ ])([1, 1], [ ]) ) the three classes are labeled with,, and. The outputs for the three different perceptrons is therefore: the first perceptron (w 1, b 1 ) will be used to distinguish all inputs belonging to class and so its output will be: y = [1, 1, 1, 1] the second (w 2, b 2 ) is used for class with: y = [ 1, 1, 1, 1] and the last (w 3, b 3 ) for class with: y = [ 1, 1, 1, 1] ie. either the input belongs to a particular class (1) or not ( 1). The perceptron algorithm is the same as before but now we simply have three problems to solve! 10

The results for the above example could look something like this: 2 1.5 1 x 2 0.5 0 0.5 1 1 0.5 0 0.5 1 1.5 2 x 1 In the outer hatched regions we have points belonging to two classes! In the inner hatched region the points are in no class. In general we may distinguish the hatched regions as follows: ( c(x) = arg max wi x ) b i 1 i m i.e. assigning x to the class whose hyperplane is further from it. 11

The perceptron algorithm (dual form) Given a linearly separable training set S R = max 1 i l x i, α = 0, b = 0, (y { 1, 1}) repeat for i = 1 to l if y i ( l j=1 α jy j xj x i b) 0 then α i α i 1 b b ηy i R 2 end if end for until no mistake made within the for loop return (α, b) to define h(x): February 13, 2003 h(x) = sgn( w x b) = sgn ( l α j y j x j x b ) = sgn ( l α j y j xj x b ) j=1 j=1 The parameter α i is referred to as the embedding strength: example i with few/many mistakes has a small/large α i, for non-separable data misclassified points keeps on growing, can be regarded as the information content of x i. 12

The Gram matrix February 13, 2003 Given a set {x 1,..., x l } of vectors from an inner product space X, the l l matrix G with entries G ij = x i x j is called the Gram matrix. The book uses sometimes the following notation: G = ( x i x j ) l i,j=1 An important observation here is that the input data enter only the algorithm through the entries of the Gram matrix! 13

Margins of a hyperplane February 13, 2003 The (functional) margin of an example (x i, y i ) with respect to a hyperplane (w, b) is the quantity (y { 1, 1}). γ i = y i ( w x i b) γ i > 0 implies a correct classification of (x i, y i ). The margin distribution of a hyperplane (w, b) w.r.t. a training set S is the distribution of the margins of examples in S. The minimum of the margin distribution is the (functional) margin of the of a hyperplane (w, b) with respect to a training set S. The geometric margin is the perpendicular Euclidean distance of the point to the hyperplane (see page 3), i.e. γ i / w. The margin of a training set S is the maximum geometric margin over all hyperplanes. A hyperplane realizing this maximal is known as a maximal margin hyperplane. 14

The margin and maximal margin February 13, 2003 The main points: The margin of S is the smallest γ/ w for the examples in S w.r.t. hyperplane (w, b). Now try to find some other hyperplane (wopt, bopt) where this margin is largest. This will be the maximal geometric margin and the corresponding hyperplane the maximal margin hyperplane. γ PSfrag replacements 15

Convergence of the Perceptron in finite time Theorem. [Novikoff] Let S be a non-trivial training set, and let R = max x i 1 i l Suppose that there exists a vector wopt such that wopt = 1 and ( y i w opt x i b opt) γ for 1 i l. The the number of mistakes made by the on-line perceptron algorithm on S is at most ( 2R γ ) 2 Note that for the dual form we can now bound α by the number of mistakes made, i.e. α 1 ( 2R γ ) 2 ( α 1 = l i=1 α i ) 16

The slack margin variable February 13, 2003 The slack margin variable measures the amount of nonseparability of the sample. replacements γ x i ξ j x j ξi 17

Convergence and non-separable data February 13, 2003 Formally, the new quantity, the margin slack variable of an example (x i, y i ) with respect to the hyperplane (w, b) and target margin γ, is defined as: ξ i = max(0, γ y i ( w xi b ). If ξ i > γ, then x i is misclassified by (w, b). The norm D = ξ takes into account any misclassification of the training data. The number of mistakes in the first execution of the for loop of the perceptron algorithm on S is bounded by: ( 2(R D) (see Freund and Schapire theorem) γ ) 2 18

Linear regression February 13, 2003 In linear regression we associate each data point x R n with a distinct output value y R and aim to construct a linear function f : R n R, f(x) = w x b where w R n and b R such that f(x) y for each data vector. In particular in the training set S = ( (x 1, y 1 ),..., (x l, y l ) ) all the y-values may be different and we determine from this set w and b in such a way that L(w, b) = l ( ) 2 w xi b yi i=1 is as small as possible. Introduce a l n matrix X with l rows and n columns so that the i-th row of X is the i-th data vector x i. Denote the j-th column of X with x j. This is the j-th attribute vector containing the j-th attribute value of all the l datavectors. Thus X = x 1 x 2. x l = [ ] x 1 x 2 x n 19

Let e denote the l-vector [1 1 1] and ˆX the l (n 1) matrix obtained by augmenting the vector e to X i.e. ˆX = [ X e ]. Let ŵ = [ w b ] [ ], and y = y1 y 2 y l then we can say that we are trying to determine ŵ in such a way that ˆXŵ y in the sense that L(ŵ) = ˆXw y 2 is as small as possible (recall z = z z 1/2 ). Assuming that l > n 1 this is called an overdetermined linear system and the solution satisfying the above criterion is called a least squares solution. Geometrically this can be interpreted in two ways: In a (n 1)-dimensional data space we determine a n- dimensional plane z = w x b so that the sum of squares of the vertical distances of the points (x i, y i ), i = 1, 2,..., l from the plane are minimized. In a l-dimensional attribute space we determine a linear combination of the attribute vectors and e, u = w 1 x 1... w n x n be, which is as close as possible to the output vector y in the sense that u y is minimized. 20

From the second point of view we have that u is chosen in such a way that it is the projection of y onto the plane spanned by x 1,..., x n and e. If we do that u y will be orthogonal to that plane and in particular to the vectors x 1,..., x n and e. This implies that ˆX ( ˆXŵ y ) = 0 which is equivalent to the normal so-called normal equations: ˆX ˆXŵ = ˆX y that we can solve to determine ŵ provided all the columns of ˆX are linearly independent so that the (n 1) (n 1) matrix ( ˆX ˆX) becomes nonsingular. There are numerically more stable methods to determine ŵ from ˆX and y based on so called QR-factorization of ˆX that are eg. used in MATLAB when an overdetermined system is solved directly. The normal equations can also be derived from the condition that L(ŵ) = L ŵ = 2 ˆX ( ˆXŵ y ) = 0 for ŵ that minimizes L. 21

A linear regression example with 10 data points and 2 attributes >> X=[3 7;4 6;5 6;7 7;8 5;4 5;5 5;6 3;7 4;9 4] % the ell x n (10x2) datamatrix X = 3 7 4 6 5 6 7 7 8 5 4 5 5 5 6 3 7 4 9 4 >> plot(x(1:5,1),x(1:5,2),,x(6:10,1),x(6:10,2), * ) >> xlabel( x1 ),ylabel( x2 ),title( input data ),axis([2 10 2 8]) 8 input data 7 6 x2 5 4 3 2 2 3 4 5 6 7 8 9 10 x1 22

>> y=[6 5 5 7 5 4 4 2 3 4] % The 10-output vector y = 6 5 5 7 5 4 4 2 3 4 >> ell=size(x,1) % The number of rows in the matrix ell = 10 >> Xhat=[X ones(ell,1)] % The datamatrix augmented with a column of % 1-elements % ones(m,n) is a vector of zero-elements % with m rows and n columns Xhat = 3 7 1 4 6 1 5 6 1 7 7 1 8 5 1 4 5 1 5 5 1 6 3 1 7 4 1 9 4 1 >> wb=xhat\y % An overdetermined system can be solved "directly" wb = 0,2603 1,2025-3,2630 23

>> wb=(xhat *Xhat)\Xhat *y % or from the normal equation wb = 0,2603 1,2025-3,2630 >> error=y-xhat*wb % The difference between given output values and % "calculated" values that we are trying to minimize error = 0,0644 0,0066-0,2538 0,0231 0,1678 0,2091-0,0512 0,0935-0,3694 0,1100 24

Relation between regression and covariances Let m x denote the vector average values of the x-values in each column of X, i.e. the average attribute value associated with each column. Calculate from the data matrix X the matrix X by subtracting from each column this average value. Similarly let m y denote the average output value and calculate from the output vector y the vector ȳ by subtracting from each output value this average output value. Then it is equivalent to solve Xw ȳ i.e. w = ( X X) 1 X ȳ and then set b = m y m x w as to solve ˆXŵ y i.e. ŵ = ( ˆX ˆX) 1 ˆX ȳ This follows from the fact that if we try to add a bias to decrease the least squares error in the former case this bias will always be zero and if y m y = w (x m x ) then y = w x m y w m x ) 25

The matrix 1 n X X is the covariance matrix of the attributes over the dataset from which we can calculate correlation coefficients between attributes. These coefficients may be thought of as the cosines of the angle between the corresponding attribute vectors in l-dimensional space reflecting how close they are to each other, where the cosine of the angle between two vectors, x and y is defined as x y x y Similarly 1 n X ȳ is a covariance vector between the output and the input attributes. 26

The example continued... >> mx=mean(x) % calculate the vector of mean values of the % values in each column of X mx = 5,8000 5,2000 >> n=size(x,2) n = 2 >> Xbar=X-ones(ell,n)*diag(mx) % diag(x) is a diagonal matrix with the elements % of the vector x along the diagonal Xbar = -2.8000 1.8000-1.8000 0.8000-0.8000 0.8000 1.2000 1.8000 2.2000-0.2000-1.8000-0.2000-0.8000-0.2000 0.2000-2.2000 1.2000-1.2000 3.2000-1.2000 >> my=mean(y) my = 4,5000 >> ybar=y-my*ones(ell,1) ybar = 1,5000 0,5000 0,5000 2,5000 0,5000-0,5000-0,5000-2,5000-1,5000-0,5000 27

>> w=xbar\ybar w = 0,2603 1,2025 >> b = my-mx*w b = -3,2630 >> covx=(1/ell)*xbar *Xbar covx = 3,3600-1,0600-1,0600 1,5600 >> cor12=covx(1,2)/sqrt(covx(1,1)*covx(2,2)) cor12 = -0,4630 >> covxy=(1/ell)*xbar *ybar covxy = -0,4000 1,6000 >> vary=(1/ell)*ybar *ybar vary = 1,8500 >> cor1y=covxy(1)/sqrt(covx(1,1)*vary) cor1y = -0,1604 >> cor2y=covxy(2)/sqrt(covx(2,2)*vary) cor2y = 0,9418 28

The steepest descent algorithm of Widrow-Hoff We are trying to find the ŵ - vector that minimizes the function L(ŵ) = 1 2( y ˆXŵ ) ( y ˆXŵ ). The negative gradient vector L(ŵ) = L ŵ = ˆX ( y ˆXŵ ) points in the direction of steepest descent. The steepest descent algorithm is an iterative algorithm based on the idea of updating ŵ by moving a suitable distance in the direction of steepest descent at each iteration, i.e. Since ŵnew = ŵ old η ˆX ( y ˆXŵold ) ˆX ( y ˆXŵ ) = l ( yi ˆx i ŵ ) ˆx i i=1 the inner loop in the algorithm in Table 2.3 on p. 23 corresponds to one step of this algorithm, but they are not fully identical because in Table 2.3 we update ŵ for each i in the inner loop so 29

when calculating y i ˆx i ŵ, ŵ will contain both new and old values. We apply the steepest descent algorithm below to L(w) = 1 2 (ȳ Xw ) (ȳ Xw ) with the example above. First we show the contour lines of L >> s=-5:0.1:5; >> t=s; >> [w1,w2]=meshgrid(s,t); >> z=0; >> for i=1:ell, z=z0.5*(xbar(i,1).*w1xbar(i,2).*w2-ybar(i)).^2; end >> contour(s,t,z,40) >> xlabel( {w_1} ),ylabel( {w_2} ),title( weight versus loss function (L) ) 5 weight versus loss function (L) 4 3 2 1 w 2 0 1 2 3 4 5 5 0 5 w 1 30

>> eta=1; % Try first $\eta=1$. Note that the direction % of steepest descent is always orthogonal % to the contour lines >> w=[3 3] w = 3 3 >> w = w - eta*xbar *(Xbar*w-ybar) w = -70 4 >> w = w - eta*xbar *(Xbar*w-ybar) w = 1,0e003 * 2,3204-0,7844 >> eta=0.1; % Try again with. >> w=[3 3] w = 3 3 >> w = w - eta*xbar *(Xbar*w-ybar) w = -4,3000 3,1000 >> w = w - eta*xbar *(Xbar*w-ybar) w = 13,0340-4,6940 >> w = w - eta*xbar *(Xbar*w-ybar) w = -36,1359 18,0447 >> eta=0.01; % Try again with $\eta=0.01$. 31

>> w=[3 3] w = 3 3 >> w = w - eta*xbar *(Xbar*w-ybar) w = 2,2700 3,0100 >> w = w - eta*xbar *(Xbar*w-ybar) w = 1,7863 2,9411 >> w = w - eta*xbar *(Xbar*w-ybar) % after 100 more iterations we finally get the % right answer with 5 significant digits. w = 0,2603 1,2026 32

Ridge regression February 13, 2003 Here we aim to strike a balance between fitting input data to output and keeping the absolute values of w and b small, by determining w and b such that: [ ] ˆX µi ŵ [ ] y 0 where I is a (n1) x (n1) identity matrix and 0 is a n-vector of zeros. This is equivalent to setting: ŵ = ( ˆX ˆX λi ) 1 ˆX y where λ = µ 2. 33

>> mu=1 mu = 1 >> Xridge=[Xhat;mu*eye(n1)] Xridge = 3 7 1 4 6 1 5 6 1 7 7 1 8 5 1 4 5 1 5 5 1 6 3 1 7 4 1 9 4 1 1 0 0 0 1 0 0 0 1 >> yridge=[y;zeros(n1,1)] yridge = 6 5 5 7 5 4 4 2 3 4 0 0 0 >> wb1=xridge\yridge wb1 = 0.0773 0.8735-0.4459 34

>> error=yridge-xridge*wb1 error = 0.0994-0.1043-0.1816 0.7904 0.4601-0.2308-0.3081-0.6383-0.5891 0.2564-0.0773-0.8735 0.4459 >> mu=10 % Put greater weight on making $\hat{\vec{w}}$ small mu = 10 >> Xridge=[Xhat;mu*eye(n1)] Xridge = 3 7 1 4 6 1 5 6 1 7 7 1 8 5 1 4 5 1 5 5 1 6 3 1 7 4 1 9 4 1 10 0 0 0 10 0 0 0 10 35

>> wb10 = Xridge\yridge wb10 = 0,2689 0,4368 0,0608 >> lambda=mu^2 % Here we calculate $\hat{\vec{w}}$ using the equivalent % alternative formulation lambda = 100 >> wb10 = (Xhat *Xhat lambda*eye(n1))\xhat *y wb10 = 0,2689 0,4368 0,0608 36

Dual ridge regression February 13, 2003 Here we try the dual approach to ridge regression described on pages 23 and 24 in textbook. The central idea that we seek a formulation based on the l l Gram matrix Ĝ = ˆX ˆX rather than the (n 1) (n 1) covariance matrix ˆX ˆX but note that the Gram matrix will be singular if l > n 1. Also note that the derivation in the book holds if we work with ˆX rather than X which we have to do in order to determine b as well as w. In particular if ( λil Ĝ) ˆα = y where ŵ = ˆX ˆα then Ĝ ( λi l Ĝ) ˆα = Ĝy and hence ˆX ( λ ˆX ˆα ˆX ) ˆX ˆX ˆα = ˆX ˆX y which in turn implies that ( λin1 ˆX ˆX)ŵ = ˆX y i.e. the equation for primal ridge regression provided the columns of ˆX are linearly independent. 37

>>lambda = 100; % i.e. 10^2 as above >>G=Xhat*Xhat G = 59 55 58 71 60 48 51 40 50 56 55 53 57 71 63 47 51 43 53 61 58 57 62 78 71 51 56 49 60 70 71 71 78 99 92 64 71 64 78 92 60 63 71 92 90 58 66 64 77 93 48 47 51 64 58 42 46 40 49 57 51 51 56 71 66 46 51 46 56 66 40 43 49 64 64 40 46 46 55 67 50 53 60 78 77 49 56 55 66 80 56 61 70 92 93 57 66 67 80 98 >> alpha=(lambda*eye(ell)g)\y alpha = 0,0208 0,0124 0,0097 0,0200 0,0060 0,0068 0,0041-0,0098-0,0069-0,0023 >> wb10=xhat *alpha % Note that these are the same coefficients % as we get with the ordinary ridge regression above. wb10 = 0.2689 0.4368 0.0608 >> lambda=0.0001 % By decreasing $\lambda$ we get closer to the % original regression values lambda = 1.0000e-004 38

>> alpha=(lambda*eye(ell)g)\y alpha = 1.0e003 * 0.6434 0.0647-2.5373 0.2354 1.6799 2.0880-0.5141 0.9305-3.6948 1.1012 >> wb001=xhat *alpha wb001 = 0.2602 1.2023-3.2612 >> alpha=g\y % The results become, however, meaningless when we % set $\lambda=0$ because $G$ is singular Warning: Matrix is close to singular or badly scaled. Results may be inaccurate. alpha = 1.0e017 * 0.6481 4.8803 6.1694-0.0003-0.3092-9.1020-5.5287 9.1030-6.1711 0.3105 >> wb=xhat *alpha wb = -960-144 -36 39

Regression, classification and discriminant analysis February 13, 2003 We can use linear regression to obtain a separator (w, b) for the dataset (x 1, y 1 ),..., (x l, y l ) where y i { 1, 1} simply by solving the system ˆXŵ y. Note that in the normal equations: ˆX ˆXŵ = ˆX y the right hand side is now the vector (of length n 1): [(l 1 m x,1 l 1 m x, 1 ) (l 1 l 1 )] where l 1 and l 1 are the number of data points in groups 1 and 1 respectively. m x,1 and m x, 1 are the (row)vectors of average attribute values within these groups 1. If l 1 = l 1 we are effectively choosing w in such a way that we maximize the difference between the mean value of w x within group 1 on one hand ans group 1 on the other, whicle restricting the variance in these values to be constant. 1 For example: say there are 5 examples in each class in located in the top and bottom half of the data matrix then the r.h.s. matrix ˆX y would be represented in MATLAB like this: 5*mean(Xhat(1:5,:)) - 5*mean(Xhat(6:10,:)) 40

This is the separation criterion used in so-called discriminant analysis in statistics (cf. p. 19 in book) The separator obtained in this way is, however, not necessarily a maximum margin separator. >> Ip = 1:5 % indicies for group 1 Ip = 1 2 3 4 5 >> Im = 6:10 % indicies for group 1 Im = 6 7 8 9 10 >> y(ip) = 1 y = 1 1 1 1 1 >> y(im) = -1 y = 1 1 1 1 1-1 -1-1 -1-1 >> wb = Xhat\y wb = 0.1059 0.7130-4.3215 >> Xhat*wb ans = 0.9869 0.3798 0.4857 1.4104 0.0903-0.3332-0.2273-1.5474-0.7285-0.5168 >> %clearly not a max-min margin separator! Why? 41

1-norm regression February 13, 2003 When solving an overdetermined system Xŵ y so that Xŵ y is minimized we do to necessarily want to define a length of a vector z 2 = ( i x2 i ) 1/2, called the two norm. An alternative is z 1 = i z i, called the one norm. With such a definition we wish to determine ŵ in such a way that is minimized. l w x i b yi i=1 In order to obtain the 1-norm solution we can formulate the problem as the following linear programming problem: min l i=1 ξ i ξ i subject to ˆXŵ ξ ξ = y, ξ 0, ξ 0 42

Note that ξ i = { yi ŵ ˆx i if this error is > 0 0 otherwise. February 13, 2003 ξ i = { yi ŵ ˆx i if this error is < 0 0 otherwise. This problem is readily solved by the MATLAB function: which solve the problem: X = linprog(f,a,b,aeq,beq,low) min f x s.t. Ax b, Aeq = beq, x low. (note that the number of elements in f, x and low and the number of columns in A and Aeq must all be the same) We introduce the following (n 1 2l)-vectors: x = [w b ξ ξ ] f = [zeros(1, n 1) ones(1, 2l)] low = [ inf*ones(1, n 1) zeros(1, 2l)] and set A = zeros(1, n 1 2l) b = 0 Aeq = [ ˆX I l I l ] beq = y 43

>> y=[6 5 5 7 5 4 4 2 3 4] ; >> f=[zeros(1,n1) ones(1,2*ell)] f = Columns 1 through 15 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 Columns 1 through 23 1 1 1 1 1 1 1 1 >> low=[-inf -inf -inf zeros(1,2*ell)] low = Columns 1 through 15 -Inf -Inf -Inf 0 0 0 0 0 0 0 0 0 0 0 0 Columns 16 through 23 0 0 0 0 0 0 0 0 >> Aeq=[Xhat eye(ell) -eye(ell)] Aeq = Columns 1 through 15 3 7 1 1 0 0 0 0 0 0 0 0 0-1 0 4 6 1 0 1 0 0 0 0 0 0 0 0 0-1 5 6 1 0 0 1 0 0 0 0 0 0 0 0 0 7 7 1 0 0 0 1 0 0 0 0 0 0 0 0 8 5 1 0 0 0 0 1 0 0 0 0 0 0 0 4 5 1 0 0 0 0 0 1 0 0 0 0 0 0 5 5 1 0 0 0 0 0 0 1 0 0 0 0 0 6 3 1 0 0 0 0 0 0 0 1 0 0 0 0 7 4 1 0 0 0 0 0 0 0 0 1 0 0 0 9 4 1 0 0 0 0 0 0 0 0 0 1 0 0 Columns 16 through 23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0-1 0 0 0 0 0 0 0 0-1 0 0 0 0 0 0 0 0-1 0 0 0 0 0 0 0 0-1 0 0 0 0 0 0 0 0-1 0 0 0 0 0 0 0 0-1 0 0 0 0 0 0 0 0-1 0 0 0 0 0 0 0 0-1 February 13, 2003 44

>> sol=linprog(f,zeros(1,n12*ell),0,aeq,y,low ) Optimization terminated successfully. sol = 0.2727 1.1818-3.1818 0.0909 0.0000 0.0000 0.0000 0.0909 0.1818 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.2727 0.0000 0.0000 0.0000 0.0909 0.0000 0.4545 0.0000 % The fist two elements of the solution are the w-values % The third element is the b-value % The next 10 values are the positive differences between ouput values and % calculated values % The last 10 values are the negative differences between ouput values and % calculated values 45

>> error = y-xhat*sol(1:3) error = 0.0909 0.0000-0.2727 0.0000 0.0909 0.1818-0.0909 0.0000-0.4545 0.0000 % note how the error relates to the solution. 46

Other possibilities February 13, 2003 Another possibility would be to determine ŵ in such a way that max w x i b yi i {i,...,l} is minimized and will be referred to as regression in -norm. In support-vector regression we encounter yet another possibility of determining ŵ. 47