Linear Learning Machines

Size: px

Start display at page:

Download "Linear Learning Machines"

Sharyl Perry
5 years ago
Views:

1 Linear Learning Machines Chapter 2 February 13, 2003 T.P. Runarsson (tpr@hi.is) and S. Sigurdsson (sven@hi.is)

2 Linear learning machines February 13, 2003 In supervised learning, the learning machine is given a training set of examples (inputs) with associated labels (output values). S = ( (x 1, y 1 ), (x 2, y 2 ),..., (x l, y l ) ) (X Y ) l (l denotes number of training samples, x i are examples or instances, and y i their labels). A training set S is said to be trivial if all labels are equal. Usually the input space is a subset of the real value space, X R n (n is the dimension of the input space). The input x = (x 1, x 2,..., x n ) is a vector length n ( denotes matrix transposition). Linear functions are probably the best understood and simplest hypothesis. A learning machine using a hypothesis that forms linear combinations of the input variables is known as a linear learning machine. 1

3 Linear classification February 13, 2003 A linear function f(x) is frequently used for binary classification, y { 1, 1}, as follows: where assign x = (x 1, x 2,..., x n ) to ve class if f(x) 0, otherwise assign to the ve class, f(x) = n w i x i b = w x b i=1 ( denotes the inner product). PSfrag replacements b w w A separating hyperplane (w, b) R n R for a 2D training set. 2

4 Linear classification - geometric interpretation PSfrag replacements b w w x i The vector w defines a direction perpendicular to the hyperplane (the dark line). The value of b moves the hyperplane parallel to itself, this value is sometimes called the bias (or threshold) and is necessary if all hyperplanes are to be represented in R n (the number of free parameters is now n 1). Recall that the perpendicular Euclidean distance from a point x i to the hyperplane is: n i=1 w ix i b w xi b =, w w where w = w w. 3

5 Rosenblatt s Perceptron February 13, 2003 Both statisticians and neural network researchers have used linear classifiers: the theory of linear discriminants by Fisher in 1936, and then perceptrons by Rosenblatt in Rosenblatt s algorithm was the first iterative procedure for learning linear classification. The algorithm is: on-line, starts with an initial connection weight vector w = 0 (the all zero vector), mistake driven, it only adapt the weights when a classification mistake is made. The algorithm converges: when all the training data is correctly classified, this requires the data to be linearly separable, does this in a finite time. 4

6 The neuron and perceptron analogy February 13, 2003 Synapse from other neurons Dendrites to other neurons Nucleus Body Axon Dendrites Synapse to other neurons PSfrag replacements x 1 x 2 w 1 w 2 Connection Node f( w x b) node output x n Input nodes w n (the threshold of the output node would be set to 0 and an extra input with a constant value of 1 added and the corresponding weight would be the bias b) 5

7 Linear separability February 13, 2003 In the 1960 Rosenblatt s work (plus some hype) spurred a huge growth of research and corresponding financial investments in this area. In 1969, however, Minsky and Papert published a book titled Perceptrons (they were working in symbol-processing AI a competitor to the perceptron approach). The book presented and condemning look at perceptrons and as a result funding was blocked for more than 10 years! Field revived by Hopfield 1982 and Rumelhart & McClelland PSfrag replacements AND XOR 6

8 The perceptron algorithm (primal form) February 13, 2003 Given a linearly separable training set S and learning rate η R w 0 = 0, b 0 = 0, k = 0, R = max 1 i l x i, (y { 1, 1}) repeat for i = 1 to l if y i ( w k x i bk ) 0 then w k1 w k ηy i x i b k1 b k ηy i R 2 k k 1 end if end for until no mistake made within the for loop return (w k, b k ) where k is the number of mistakes. Note that the contribution of example i to the weight change is: α i ηy i x i where α i is the number of times x i has been misclassified (i.e. k = l i=1 α i). Therefore we may write: w = l α i y i x i i=1 assuming the initial weights are zero. The learning rate η only changes the scaling of the hyperplane and so is not really needed. 7

9 Perceptron (primal) numerical example The following are the first few steps of the perceptron algorithm (primal form), for the OR problem: sample_id x(1) x(2) y We start with w (0) = 0, b (0) = 0, k = 0 and examine each example in turn, for the first sample x = (x(1), x(2)) = (0, 0) and y = 1: f(x) = w (0) (1) x(1)w (0) (2) x(2)b (0) = = 0 but it should be ve because y = 1 for x = (0, 0) and so we must do an update: w (k1) (1) = w (k) (1) η y x(1) = = 0 here we have chosen η = 1. Similarily: w (k1) (2) = w (k) (2) η y x(2) = = 0 and finally we update the bias: b (k1) = b (k) η y R 2 = = 2 here R is the largest norm of the input vectors: x 1 = 0, x 2 = 1, x 3 = 1, x 4 = 2 that is R = 2, Finally we update our counter k = k 1. 8

10 Now we have still w (1) = 0 but b (1) = 2 lets examine the second sample x = (x(1), x(2)) = (1, 0) and y = 1: f(x) = w (1) (1) x(1)w (1) (2) x(2)b (1) = = 2 but it should be ve because y = 1 for x = (1, 0) and so we must do an update: and: w (k1) (1) = w (k) (1) η y x(1) = = 1 w (k1) (2) = w (k) (2) η y x(2) = = 0 and finally the bias: b (k1) = b (k) η y R 2 = = 0 update our counter k = k 1 again. Continue looping through the examples until no more classification mistakes are made. 9

11 Multi-class perceptron February 13, 2003 When we have m different classes simply create m perceptrons! For example, say there are m = 3 classes and two inputs (n = 2), and we are given the following training set: S = ( ([0, 0], [ ]), ([1, 0], [ ]), ([0, 1], [ ])([1, 1], [ ]) ) the three classes are labeled with,, and. The outputs for the three different perceptrons is therefore: the first perceptron (w 1, b 1 ) will be used to distinguish all inputs belonging to class and so its output will be: y = [1, 1, 1, 1] the second (w 2, b 2 ) is used for class with: y = [ 1, 1, 1, 1] and the last (w 3, b 3 ) for class with: y = [ 1, 1, 1, 1] ie. either the input belongs to a particular class (1) or not ( 1). The perceptron algorithm is the same as before but now we simply have three problems to solve! 10

12 The results for the above example could look something like this: x x 1 In the outer hatched regions we have points belonging to two classes! In the inner hatched region the points are in no class. In general we may distinguish the hatched regions as follows: ( c(x) = arg max wi x ) b i 1 i m i.e. assigning x to the class whose hyperplane is further from it. 11

13 The perceptron algorithm (dual form) Given a linearly separable training set S R = max 1 i l x i, α = 0, b = 0, (y { 1, 1}) repeat for i = 1 to l if y i ( l j=1 α jy j xj x i b) 0 then α i α i 1 b b ηy i R 2 end if end for until no mistake made within the for loop return (α, b) to define h(x): February 13, 2003 h(x) = sgn( w x b) = sgn ( l α j y j x j x b ) = sgn ( l α j y j xj x b ) j=1 j=1 The parameter α i is referred to as the embedding strength: example i with few/many mistakes has a small/large α i, for non-separable data misclassified points keeps on growing, can be regarded as the information content of x i. 12

14 The Gram matrix February 13, 2003 Given a set {x 1,..., x l } of vectors from an inner product space X, the l l matrix G with entries G ij = x i x j is called the Gram matrix. The book uses sometimes the following notation: G = ( x i x j ) l i,j=1 An important observation here is that the input data enter only the algorithm through the entries of the Gram matrix! 13

15 Margins of a hyperplane February 13, 2003 The (functional) margin of an example (x i, y i ) with respect to a hyperplane (w, b) is the quantity (y { 1, 1}). γ i = y i ( w x i b) γ i > 0 implies a correct classification of (x i, y i ). The margin distribution of a hyperplane (w, b) w.r.t. a training set S is the distribution of the margins of examples in S. The minimum of the margin distribution is the (functional) margin of the of a hyperplane (w, b) with respect to a training set S. The geometric margin is the perpendicular Euclidean distance of the point to the hyperplane (see page 3), i.e. γ i / w. The margin of a training set S is the maximum geometric margin over all hyperplanes. A hyperplane realizing this maximal is known as a maximal margin hyperplane. 14

16 The margin and maximal margin February 13, 2003 The main points: The margin of S is the smallest γ/ w for the examples in S w.r.t. hyperplane (w, b). Now try to find some other hyperplane (wopt, bopt) where this margin is largest. This will be the maximal geometric margin and the corresponding hyperplane the maximal margin hyperplane. γ PSfrag replacements 15

17 Convergence of the Perceptron in finite time Theorem. [Novikoff] Let S be a non-trivial training set, and let R = max x i 1 i l Suppose that there exists a vector wopt such that wopt = 1 and ( y i w opt x i b opt) γ for 1 i l. The the number of mistakes made by the on-line perceptron algorithm on S is at most ( 2R γ ) 2 Note that for the dual form we can now bound α by the number of mistakes made, i.e. α 1 ( 2R γ ) 2 ( α 1 = l i=1 α i ) 16

18 The slack margin variable February 13, 2003 The slack margin variable measures the amount of nonseparability of the sample. replacements γ x i ξ j x j ξi 17

19 Convergence and non-separable data February 13, 2003 Formally, the new quantity, the margin slack variable of an example (x i, y i ) with respect to the hyperplane (w, b) and target margin γ, is defined as: ξ i = max(0, γ y i ( w xi b ). If ξ i > γ, then x i is misclassified by (w, b). The norm D = ξ takes into account any misclassification of the training data. The number of mistakes in the first execution of the for loop of the perceptron algorithm on S is bounded by: ( 2(R D) (see Freund and Schapire theorem) γ ) 2 18

20 Linear regression February 13, 2003 In linear regression we associate each data point x R n with a distinct output value y R and aim to construct a linear function f : R n R, f(x) = w x b where w R n and b R such that f(x) y for each data vector. In particular in the training set S = ( (x 1, y 1 ),..., (x l, y l ) ) all the y-values may be different and we determine from this set w and b in such a way that L(w, b) = l ( ) 2 w xi b yi i=1 is as small as possible. Introduce a l n matrix X with l rows and n columns so that the i-th row of X is the i-th data vector x i. Denote the j-th column of X with x j. This is the j-th attribute vector containing the j-th attribute value of all the l datavectors. Thus X = x 1 x 2. x l = [ ] x 1 x 2 x n 19

21 Let e denote the l-vector [1 1 1] and ˆX the l (n 1) matrix obtained by augmenting the vector e to X i.e. ˆX = [ X e ]. Let ŵ = [ w b ] [ ], and y = y1 y 2 y l then we can say that we are trying to determine ŵ in such a way that ˆXŵ y in the sense that L(ŵ) = ˆXw y 2 is as small as possible (recall z = z z 1/2 ). Assuming that l > n 1 this is called an overdetermined linear system and the solution satisfying the above criterion is called a least squares solution. Geometrically this can be interpreted in two ways: In a (n 1)-dimensional data space we determine a n- dimensional plane z = w x b so that the sum of squares of the vertical distances of the points (x i, y i ), i = 1, 2,..., l from the plane are minimized. In a l-dimensional attribute space we determine a linear combination of the attribute vectors and e, u = w 1 x 1... w n x n be, which is as close as possible to the output vector y in the sense that u y is minimized. 20

22 From the second point of view we have that u is chosen in such a way that it is the projection of y onto the plane spanned by x 1,..., x n and e. If we do that u y will be orthogonal to that plane and in particular to the vectors x 1,..., x n and e. This implies that ˆX ( ˆXŵ y ) = 0 which is equivalent to the normal so-called normal equations: ˆX ˆXŵ = ˆX y that we can solve to determine ŵ provided all the columns of ˆX are linearly independent so that the (n 1) (n 1) matrix ( ˆX ˆX) becomes nonsingular. There are numerically more stable methods to determine ŵ from ˆX and y based on so called QR-factorization of ˆX that are eg. used in MATLAB when an overdetermined system is solved directly. The normal equations can also be derived from the condition that L(ŵ) = L ŵ = 2 ˆX ( ˆXŵ y ) = 0 for ŵ that minimizes L. 21

23 A linear regression example with 10 data points and 2 attributes >> X=[3 7;4 6;5 6;7 7;8 5;4 5;5 5;6 3;7 4;9 4] % the ell x n (10x2) datamatrix X = >> plot(x(1:5,1),x(1:5,2),,x(6:10,1),x(6:10,2), * ) >> xlabel( x1 ),ylabel( x2 ),title( input data ),axis([ ]) 8 input data 7 6 x x1 22

24 >> y=[ ] % The 10-output vector y = >> ell=size(x,1) % The number of rows in the matrix ell = 10 >> Xhat=[X ones(ell,1)] % The datamatrix augmented with a column of % 1-elements % ones(m,n) is a vector of zero-elements % with m rows and n columns Xhat = >> wb=xhat\y % An overdetermined system can be solved "directly" wb = 0,2603 1,2025-3,

25 >> wb=(xhat *Xhat)\Xhat *y % or from the normal equation wb = 0,2603 1,2025-3,2630 >> error=y-xhat*wb % The difference between given output values and % "calculated" values that we are trying to minimize error = 0,0644 0,0066-0,2538 0,0231 0,1678 0,2091-0,0512 0,0935-0,3694 0,

26 Relation between regression and covariances Let m x denote the vector average values of the x-values in each column of X, i.e. the average attribute value associated with each column. Calculate from the data matrix X the matrix X by subtracting from each column this average value. Similarly let m y denote the average output value and calculate from the output vector y the vector ȳ by subtracting from each output value this average output value. Then it is equivalent to solve Xw ȳ i.e. w = ( X X) 1 X ȳ and then set b = m y m x w as to solve ˆXŵ y i.e. ŵ = ( ˆX ˆX) 1 ˆX ȳ This follows from the fact that if we try to add a bias to decrease the least squares error in the former case this bias will always be zero and if y m y = w (x m x ) then y = w x m y w m x ) 25

27 The matrix 1 n X X is the covariance matrix of the attributes over the dataset from which we can calculate correlation coefficients between attributes. These coefficients may be thought of as the cosines of the angle between the corresponding attribute vectors in l-dimensional space reflecting how close they are to each other, where the cosine of the angle between two vectors, x and y is defined as x y x y Similarly 1 n X ȳ is a covariance vector between the output and the input attributes. 26

28 The example continued... >> mx=mean(x) % calculate the vector of mean values of the % values in each column of X mx = 5,8000 5,2000 >> n=size(x,2) n = 2 >> Xbar=X-ones(ell,n)*diag(mx) % diag(x) is a diagonal matrix with the elements % of the vector x along the diagonal Xbar = >> my=mean(y) my = 4,5000 >> ybar=y-my*ones(ell,1) ybar = 1,5000 0,5000 0,5000 2,5000 0,5000-0,5000-0,5000-2,5000-1,5000-0,

29 >> w=xbar\ybar w = 0,2603 1,2025 >> b = my-mx*w b = -3,2630 >> covx=(1/ell)*xbar *Xbar covx = 3,3600-1,0600-1,0600 1,5600 >> cor12=covx(1,2)/sqrt(covx(1,1)*covx(2,2)) cor12 = -0,4630 >> covxy=(1/ell)*xbar *ybar covxy = -0,4000 1,6000 >> vary=(1/ell)*ybar *ybar vary = 1,8500 >> cor1y=covxy(1)/sqrt(covx(1,1)*vary) cor1y = -0,1604 >> cor2y=covxy(2)/sqrt(covx(2,2)*vary) cor2y = 0,

30 The steepest descent algorithm of Widrow-Hoff We are trying to find the ŵ - vector that minimizes the function L(ŵ) = 1 2( y ˆXŵ ) ( y ˆXŵ ). The negative gradient vector L(ŵ) = L ŵ = ˆX ( y ˆXŵ ) points in the direction of steepest descent. The steepest descent algorithm is an iterative algorithm based on the idea of updating ŵ by moving a suitable distance in the direction of steepest descent at each iteration, i.e. Since ŵnew = ŵ old η ˆX ( y ˆXŵold ) ˆX ( y ˆXŵ ) = l ( yi ˆx i ŵ ) ˆx i i=1 the inner loop in the algorithm in Table 2.3 on p. 23 corresponds to one step of this algorithm, but they are not fully identical because in Table 2.3 we update ŵ for each i in the inner loop so 29

31 when calculating y i ˆx i ŵ, ŵ will contain both new and old values. We apply the steepest descent algorithm below to L(w) = 1 2 (ȳ Xw ) (ȳ Xw ) with the example above. First we show the contour lines of L >> s=-5:0.1:5; >> t=s; >> [w1,w2]=meshgrid(s,t); >> z=0; >> for i=1:ell, z=z0.5*(xbar(i,1).*w1xbar(i,2).*w2-ybar(i)).^2; end >> contour(s,t,z,40) >> xlabel( {w_1} ),ylabel( {w_2} ),title( weight versus loss function (L) ) 5 weight versus loss function (L) w w 1 30

32 >> eta=1; % Try first $\eta=1$. Note that the direction % of steepest descent is always orthogonal % to the contour lines >> w=[3 3] w = 3 3 >> w = w - eta*xbar *(Xbar*w-ybar) w = >> w = w - eta*xbar *(Xbar*w-ybar) w = 1,0e003 * 2,3204-0,7844 >> eta=0.1; % Try again with. >> w=[3 3] w = 3 3 >> w = w - eta*xbar *(Xbar*w-ybar) w = -4,3000 3,1000 >> w = w - eta*xbar *(Xbar*w-ybar) w = 13,0340-4,6940 >> w = w - eta*xbar *(Xbar*w-ybar) w = -36, ,0447 >> eta=0.01; % Try again with $\eta=0.01$. 31

33 >> w=[3 3] w = 3 3 >> w = w - eta*xbar *(Xbar*w-ybar) w = 2,2700 3,0100 >> w = w - eta*xbar *(Xbar*w-ybar) w = 1,7863 2,9411 >> w = w - eta*xbar *(Xbar*w-ybar) % after 100 more iterations we finally get the % right answer with 5 significant digits. w = 0,2603 1,

34 Ridge regression February 13, 2003 Here we aim to strike a balance between fitting input data to output and keeping the absolute values of w and b small, by determining w and b such that: [ ] ˆX µi ŵ [ ] y 0 where I is a (n1) x (n1) identity matrix and 0 is a n-vector of zeros. This is equivalent to setting: ŵ = ( ˆX ˆX λi ) 1 ˆX y where λ = µ 2. 33

35 >> mu=1 mu = 1 >> Xridge=[Xhat;mu*eye(n1)] Xridge = >> yridge=[y;zeros(n1,1)] yridge = >> wb1=xridge\yridge wb1 =

36 >> error=yridge-xridge*wb1 error = >> mu=10 % Put greater weight on making $\hat{\vec{w}}$ small mu = 10 >> Xridge=[Xhat;mu*eye(n1)] Xridge =

37 >> wb10 = Xridge\yridge wb10 = 0,2689 0,4368 0,0608 >> lambda=mu^2 % Here we calculate $\hat{\vec{w}}$ using the equivalent % alternative formulation lambda = 100 >> wb10 = (Xhat *Xhat lambda*eye(n1))\xhat *y wb10 = 0,2689 0,4368 0,

38 Dual ridge regression February 13, 2003 Here we try the dual approach to ridge regression described on pages 23 and 24 in textbook. The central idea that we seek a formulation based on the l l Gram matrix Ĝ = ˆX ˆX rather than the (n 1) (n 1) covariance matrix ˆX ˆX but note that the Gram matrix will be singular if l > n 1. Also note that the derivation in the book holds if we work with ˆX rather than X which we have to do in order to determine b as well as w. In particular if ( λil Ĝ) ˆα = y where ŵ = ˆX ˆα then Ĝ ( λi l Ĝ) ˆα = Ĝy and hence ˆX ( λ ˆX ˆα ˆX ) ˆX ˆX ˆα = ˆX ˆX y which in turn implies that ( λin1 ˆX ˆX)ŵ = ˆX y i.e. the equation for primal ridge regression provided the columns of ˆX are linearly independent. 37

39 >>lambda = 100; % i.e. 10^2 as above >>G=Xhat*Xhat G = >> alpha=(lambda*eye(ell)g)\y alpha = 0,0208 0,0124 0,0097 0,0200 0,0060 0,0068 0,0041-0,0098-0,0069-0,0023 >> wb10=xhat *alpha % Note that these are the same coefficients % as we get with the ordinary ridge regression above. wb10 = >> lambda= % By decreasing $\lambda$ we get closer to the % original regression values lambda = e

40 >> alpha=(lambda*eye(ell)g)\y alpha = 1.0e003 * >> wb001=xhat *alpha wb001 = >> alpha=g\y % The results become, however, meaningless when we % set $\lambda=0$ because $G$ is singular Warning: Matrix is close to singular or badly scaled. Results may be inaccurate. alpha = 1.0e017 * >> wb=xhat *alpha wb =

41 Regression, classification and discriminant analysis February 13, 2003 We can use linear regression to obtain a separator (w, b) for the dataset (x 1, y 1 ),..., (x l, y l ) where y i { 1, 1} simply by solving the system ˆXŵ y. Note that in the normal equations: ˆX ˆXŵ = ˆX y the right hand side is now the vector (of length n 1): [(l 1 m x,1 l 1 m x, 1 ) (l 1 l 1 )] where l 1 and l 1 are the number of data points in groups 1 and 1 respectively. m x,1 and m x, 1 are the (row)vectors of average attribute values within these groups 1. If l 1 = l 1 we are effectively choosing w in such a way that we maximize the difference between the mean value of w x within group 1 on one hand ans group 1 on the other, whicle restricting the variance in these values to be constant. 1 For example: say there are 5 examples in each class in located in the top and bottom half of the data matrix then the r.h.s. matrix ˆX y would be represented in MATLAB like this: 5*mean(Xhat(1:5,:)) - 5*mean(Xhat(6:10,:)) 40

42 This is the separation criterion used in so-called discriminant analysis in statistics (cf. p. 19 in book) The separator obtained in this way is, however, not necessarily a maximum margin separator. >> Ip = 1:5 % indicies for group 1 Ip = >> Im = 6:10 % indicies for group 1 Im = >> y(ip) = 1 y = >> y(im) = -1 y = >> wb = Xhat\y wb = >> Xhat*wb ans = >> %clearly not a max-min margin separator! Why? 41

43 1-norm regression February 13, 2003 When solving an overdetermined system Xŵ y so that Xŵ y is minimized we do to necessarily want to define a length of a vector z 2 = ( i x2 i ) 1/2, called the two norm. An alternative is z 1 = i z i, called the one norm. With such a definition we wish to determine ŵ in such a way that is minimized. l w x i b yi i=1 In order to obtain the 1-norm solution we can formulate the problem as the following linear programming problem: min l i=1 ξ i ξ i subject to ˆXŵ ξ ξ = y, ξ 0, ξ 0 42

44 Note that ξ i = { yi ŵ ˆx i if this error is > 0 0 otherwise. February 13, 2003 ξ i = { yi ŵ ˆx i if this error is < 0 0 otherwise. This problem is readily solved by the MATLAB function: which solve the problem: X = linprog(f,a,b,aeq,beq,low) min f x s.t. Ax b, Aeq = beq, x low. (note that the number of elements in f, x and low and the number of columns in A and Aeq must all be the same) We introduce the following (n 1 2l)-vectors: x = [w b ξ ξ ] f = [zeros(1, n 1) ones(1, 2l)] low = [ inf*ones(1, n 1) zeros(1, 2l)] and set A = zeros(1, n 1 2l) b = 0 Aeq = [ ˆX I l I l ] beq = y 43

45 >> y=[ ] ; >> f=[zeros(1,n1) ones(1,2*ell)] f = Columns 1 through Columns 1 through >> low=[-inf -inf -inf zeros(1,2*ell)] low = Columns 1 through 15 -Inf -Inf -Inf Columns 16 through >> Aeq=[Xhat eye(ell) -eye(ell)] Aeq = Columns 1 through Columns 16 through February 13,

46 >> sol=linprog(f,zeros(1,n12*ell),0,aeq,y,low ) Optimization terminated successfully. sol = % The fist two elements of the solution are the w-values % The third element is the b-value % The next 10 values are the positive differences between ouput values and % calculated values % The last 10 values are the negative differences between ouput values and % calculated values 45

47 >> error = y-xhat*sol(1:3) error = % note how the error relates to the solution. 46

48 Other possibilities February 13, 2003 Another possibility would be to determine ŵ in such a way that max w x i b yi i {i,...,l} is minimized and will be referred to as regression in -norm. In support-vector regression we encounter yet another possibility of determining ŵ. 47

Brief Introduction to Machine Learning

Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector