Linear Learning Machines

Size: px
Start display at page:

Download "Linear Learning Machines"

Transcription

1 Linear Learning Machines Chapter 2 February 13, 2003 T.P. Runarsson (tpr@hi.is) and S. Sigurdsson (sven@hi.is)

2 Linear learning machines February 13, 2003 In supervised learning, the learning machine is given a training set of examples (inputs) with associated labels (output values). S = ( (x 1, y 1 ), (x 2, y 2 ),..., (x l, y l ) ) (X Y ) l (l denotes number of training samples, x i are examples or instances, and y i their labels). A training set S is said to be trivial if all labels are equal. Usually the input space is a subset of the real value space, X R n (n is the dimension of the input space). The input x = (x 1, x 2,..., x n ) is a vector length n ( denotes matrix transposition). Linear functions are probably the best understood and simplest hypothesis. A learning machine using a hypothesis that forms linear combinations of the input variables is known as a linear learning machine. 1

3 Linear classification February 13, 2003 A linear function f(x) is frequently used for binary classification, y { 1, 1}, as follows: where assign x = (x 1, x 2,..., x n ) to ve class if f(x) 0, otherwise assign to the ve class, f(x) = n w i x i b = w x b i=1 ( denotes the inner product). PSfrag replacements b w w A separating hyperplane (w, b) R n R for a 2D training set. 2

4 Linear classification - geometric interpretation PSfrag replacements b w w x i The vector w defines a direction perpendicular to the hyperplane (the dark line). The value of b moves the hyperplane parallel to itself, this value is sometimes called the bias (or threshold) and is necessary if all hyperplanes are to be represented in R n (the number of free parameters is now n 1). Recall that the perpendicular Euclidean distance from a point x i to the hyperplane is: n i=1 w ix i b w xi b =, w w where w = w w. 3

5 Rosenblatt s Perceptron February 13, 2003 Both statisticians and neural network researchers have used linear classifiers: the theory of linear discriminants by Fisher in 1936, and then perceptrons by Rosenblatt in Rosenblatt s algorithm was the first iterative procedure for learning linear classification. The algorithm is: on-line, starts with an initial connection weight vector w = 0 (the all zero vector), mistake driven, it only adapt the weights when a classification mistake is made. The algorithm converges: when all the training data is correctly classified, this requires the data to be linearly separable, does this in a finite time. 4

6 The neuron and perceptron analogy February 13, 2003 Synapse from other neurons Dendrites to other neurons Nucleus Body Axon Dendrites Synapse to other neurons PSfrag replacements x 1 x 2 w 1 w 2 Connection Node f( w x b) node output x n Input nodes w n (the threshold of the output node would be set to 0 and an extra input with a constant value of 1 added and the corresponding weight would be the bias b) 5

7 Linear separability February 13, 2003 In the 1960 Rosenblatt s work (plus some hype) spurred a huge growth of research and corresponding financial investments in this area. In 1969, however, Minsky and Papert published a book titled Perceptrons (they were working in symbol-processing AI a competitor to the perceptron approach). The book presented and condemning look at perceptrons and as a result funding was blocked for more than 10 years! Field revived by Hopfield 1982 and Rumelhart & McClelland PSfrag replacements AND XOR 6

8 The perceptron algorithm (primal form) February 13, 2003 Given a linearly separable training set S and learning rate η R w 0 = 0, b 0 = 0, k = 0, R = max 1 i l x i, (y { 1, 1}) repeat for i = 1 to l if y i ( w k x i bk ) 0 then w k1 w k ηy i x i b k1 b k ηy i R 2 k k 1 end if end for until no mistake made within the for loop return (w k, b k ) where k is the number of mistakes. Note that the contribution of example i to the weight change is: α i ηy i x i where α i is the number of times x i has been misclassified (i.e. k = l i=1 α i). Therefore we may write: w = l α i y i x i i=1 assuming the initial weights are zero. The learning rate η only changes the scaling of the hyperplane and so is not really needed. 7

9 Perceptron (primal) numerical example The following are the first few steps of the perceptron algorithm (primal form), for the OR problem: sample_id x(1) x(2) y We start with w (0) = 0, b (0) = 0, k = 0 and examine each example in turn, for the first sample x = (x(1), x(2)) = (0, 0) and y = 1: f(x) = w (0) (1) x(1)w (0) (2) x(2)b (0) = = 0 but it should be ve because y = 1 for x = (0, 0) and so we must do an update: w (k1) (1) = w (k) (1) η y x(1) = = 0 here we have chosen η = 1. Similarily: w (k1) (2) = w (k) (2) η y x(2) = = 0 and finally we update the bias: b (k1) = b (k) η y R 2 = = 2 here R is the largest norm of the input vectors: x 1 = 0, x 2 = 1, x 3 = 1, x 4 = 2 that is R = 2, Finally we update our counter k = k 1. 8

10 Now we have still w (1) = 0 but b (1) = 2 lets examine the second sample x = (x(1), x(2)) = (1, 0) and y = 1: f(x) = w (1) (1) x(1)w (1) (2) x(2)b (1) = = 2 but it should be ve because y = 1 for x = (1, 0) and so we must do an update: and: w (k1) (1) = w (k) (1) η y x(1) = = 1 w (k1) (2) = w (k) (2) η y x(2) = = 0 and finally the bias: b (k1) = b (k) η y R 2 = = 0 update our counter k = k 1 again. Continue looping through the examples until no more classification mistakes are made. 9

11 Multi-class perceptron February 13, 2003 When we have m different classes simply create m perceptrons! For example, say there are m = 3 classes and two inputs (n = 2), and we are given the following training set: S = ( ([0, 0], [ ]), ([1, 0], [ ]), ([0, 1], [ ])([1, 1], [ ]) ) the three classes are labeled with,, and. The outputs for the three different perceptrons is therefore: the first perceptron (w 1, b 1 ) will be used to distinguish all inputs belonging to class and so its output will be: y = [1, 1, 1, 1] the second (w 2, b 2 ) is used for class with: y = [ 1, 1, 1, 1] and the last (w 3, b 3 ) for class with: y = [ 1, 1, 1, 1] ie. either the input belongs to a particular class (1) or not ( 1). The perceptron algorithm is the same as before but now we simply have three problems to solve! 10

12 The results for the above example could look something like this: x x 1 In the outer hatched regions we have points belonging to two classes! In the inner hatched region the points are in no class. In general we may distinguish the hatched regions as follows: ( c(x) = arg max wi x ) b i 1 i m i.e. assigning x to the class whose hyperplane is further from it. 11

13 The perceptron algorithm (dual form) Given a linearly separable training set S R = max 1 i l x i, α = 0, b = 0, (y { 1, 1}) repeat for i = 1 to l if y i ( l j=1 α jy j xj x i b) 0 then α i α i 1 b b ηy i R 2 end if end for until no mistake made within the for loop return (α, b) to define h(x): February 13, 2003 h(x) = sgn( w x b) = sgn ( l α j y j x j x b ) = sgn ( l α j y j xj x b ) j=1 j=1 The parameter α i is referred to as the embedding strength: example i with few/many mistakes has a small/large α i, for non-separable data misclassified points keeps on growing, can be regarded as the information content of x i. 12

14 The Gram matrix February 13, 2003 Given a set {x 1,..., x l } of vectors from an inner product space X, the l l matrix G with entries G ij = x i x j is called the Gram matrix. The book uses sometimes the following notation: G = ( x i x j ) l i,j=1 An important observation here is that the input data enter only the algorithm through the entries of the Gram matrix! 13

15 Margins of a hyperplane February 13, 2003 The (functional) margin of an example (x i, y i ) with respect to a hyperplane (w, b) is the quantity (y { 1, 1}). γ i = y i ( w x i b) γ i > 0 implies a correct classification of (x i, y i ). The margin distribution of a hyperplane (w, b) w.r.t. a training set S is the distribution of the margins of examples in S. The minimum of the margin distribution is the (functional) margin of the of a hyperplane (w, b) with respect to a training set S. The geometric margin is the perpendicular Euclidean distance of the point to the hyperplane (see page 3), i.e. γ i / w. The margin of a training set S is the maximum geometric margin over all hyperplanes. A hyperplane realizing this maximal is known as a maximal margin hyperplane. 14

16 The margin and maximal margin February 13, 2003 The main points: The margin of S is the smallest γ/ w for the examples in S w.r.t. hyperplane (w, b). Now try to find some other hyperplane (wopt, bopt) where this margin is largest. This will be the maximal geometric margin and the corresponding hyperplane the maximal margin hyperplane. γ PSfrag replacements 15

17 Convergence of the Perceptron in finite time Theorem. [Novikoff] Let S be a non-trivial training set, and let R = max x i 1 i l Suppose that there exists a vector wopt such that wopt = 1 and ( y i w opt x i b opt) γ for 1 i l. The the number of mistakes made by the on-line perceptron algorithm on S is at most ( 2R γ ) 2 Note that for the dual form we can now bound α by the number of mistakes made, i.e. α 1 ( 2R γ ) 2 ( α 1 = l i=1 α i ) 16

18 The slack margin variable February 13, 2003 The slack margin variable measures the amount of nonseparability of the sample. replacements γ x i ξ j x j ξi 17

19 Convergence and non-separable data February 13, 2003 Formally, the new quantity, the margin slack variable of an example (x i, y i ) with respect to the hyperplane (w, b) and target margin γ, is defined as: ξ i = max(0, γ y i ( w xi b ). If ξ i > γ, then x i is misclassified by (w, b). The norm D = ξ takes into account any misclassification of the training data. The number of mistakes in the first execution of the for loop of the perceptron algorithm on S is bounded by: ( 2(R D) (see Freund and Schapire theorem) γ ) 2 18

20 Linear regression February 13, 2003 In linear regression we associate each data point x R n with a distinct output value y R and aim to construct a linear function f : R n R, f(x) = w x b where w R n and b R such that f(x) y for each data vector. In particular in the training set S = ( (x 1, y 1 ),..., (x l, y l ) ) all the y-values may be different and we determine from this set w and b in such a way that L(w, b) = l ( ) 2 w xi b yi i=1 is as small as possible. Introduce a l n matrix X with l rows and n columns so that the i-th row of X is the i-th data vector x i. Denote the j-th column of X with x j. This is the j-th attribute vector containing the j-th attribute value of all the l datavectors. Thus X = x 1 x 2. x l = [ ] x 1 x 2 x n 19

21 Let e denote the l-vector [1 1 1] and ˆX the l (n 1) matrix obtained by augmenting the vector e to X i.e. ˆX = [ X e ]. Let ŵ = [ w b ] [ ], and y = y1 y 2 y l then we can say that we are trying to determine ŵ in such a way that ˆXŵ y in the sense that L(ŵ) = ˆXw y 2 is as small as possible (recall z = z z 1/2 ). Assuming that l > n 1 this is called an overdetermined linear system and the solution satisfying the above criterion is called a least squares solution. Geometrically this can be interpreted in two ways: In a (n 1)-dimensional data space we determine a n- dimensional plane z = w x b so that the sum of squares of the vertical distances of the points (x i, y i ), i = 1, 2,..., l from the plane are minimized. In a l-dimensional attribute space we determine a linear combination of the attribute vectors and e, u = w 1 x 1... w n x n be, which is as close as possible to the output vector y in the sense that u y is minimized. 20

22 From the second point of view we have that u is chosen in such a way that it is the projection of y onto the plane spanned by x 1,..., x n and e. If we do that u y will be orthogonal to that plane and in particular to the vectors x 1,..., x n and e. This implies that ˆX ( ˆXŵ y ) = 0 which is equivalent to the normal so-called normal equations: ˆX ˆXŵ = ˆX y that we can solve to determine ŵ provided all the columns of ˆX are linearly independent so that the (n 1) (n 1) matrix ( ˆX ˆX) becomes nonsingular. There are numerically more stable methods to determine ŵ from ˆX and y based on so called QR-factorization of ˆX that are eg. used in MATLAB when an overdetermined system is solved directly. The normal equations can also be derived from the condition that L(ŵ) = L ŵ = 2 ˆX ( ˆXŵ y ) = 0 for ŵ that minimizes L. 21

23 A linear regression example with 10 data points and 2 attributes >> X=[3 7;4 6;5 6;7 7;8 5;4 5;5 5;6 3;7 4;9 4] % the ell x n (10x2) datamatrix X = >> plot(x(1:5,1),x(1:5,2),,x(6:10,1),x(6:10,2), * ) >> xlabel( x1 ),ylabel( x2 ),title( input data ),axis([ ]) 8 input data 7 6 x x1 22

24 >> y=[ ] % The 10-output vector y = >> ell=size(x,1) % The number of rows in the matrix ell = 10 >> Xhat=[X ones(ell,1)] % The datamatrix augmented with a column of % 1-elements % ones(m,n) is a vector of zero-elements % with m rows and n columns Xhat = >> wb=xhat\y % An overdetermined system can be solved "directly" wb = 0,2603 1,2025-3,

25 >> wb=(xhat *Xhat)\Xhat *y % or from the normal equation wb = 0,2603 1,2025-3,2630 >> error=y-xhat*wb % The difference between given output values and % "calculated" values that we are trying to minimize error = 0,0644 0,0066-0,2538 0,0231 0,1678 0,2091-0,0512 0,0935-0,3694 0,

26 Relation between regression and covariances Let m x denote the vector average values of the x-values in each column of X, i.e. the average attribute value associated with each column. Calculate from the data matrix X the matrix X by subtracting from each column this average value. Similarly let m y denote the average output value and calculate from the output vector y the vector ȳ by subtracting from each output value this average output value. Then it is equivalent to solve Xw ȳ i.e. w = ( X X) 1 X ȳ and then set b = m y m x w as to solve ˆXŵ y i.e. ŵ = ( ˆX ˆX) 1 ˆX ȳ This follows from the fact that if we try to add a bias to decrease the least squares error in the former case this bias will always be zero and if y m y = w (x m x ) then y = w x m y w m x ) 25

27 The matrix 1 n X X is the covariance matrix of the attributes over the dataset from which we can calculate correlation coefficients between attributes. These coefficients may be thought of as the cosines of the angle between the corresponding attribute vectors in l-dimensional space reflecting how close they are to each other, where the cosine of the angle between two vectors, x and y is defined as x y x y Similarly 1 n X ȳ is a covariance vector between the output and the input attributes. 26

28 The example continued... >> mx=mean(x) % calculate the vector of mean values of the % values in each column of X mx = 5,8000 5,2000 >> n=size(x,2) n = 2 >> Xbar=X-ones(ell,n)*diag(mx) % diag(x) is a diagonal matrix with the elements % of the vector x along the diagonal Xbar = >> my=mean(y) my = 4,5000 >> ybar=y-my*ones(ell,1) ybar = 1,5000 0,5000 0,5000 2,5000 0,5000-0,5000-0,5000-2,5000-1,5000-0,

29 >> w=xbar\ybar w = 0,2603 1,2025 >> b = my-mx*w b = -3,2630 >> covx=(1/ell)*xbar *Xbar covx = 3,3600-1,0600-1,0600 1,5600 >> cor12=covx(1,2)/sqrt(covx(1,1)*covx(2,2)) cor12 = -0,4630 >> covxy=(1/ell)*xbar *ybar covxy = -0,4000 1,6000 >> vary=(1/ell)*ybar *ybar vary = 1,8500 >> cor1y=covxy(1)/sqrt(covx(1,1)*vary) cor1y = -0,1604 >> cor2y=covxy(2)/sqrt(covx(2,2)*vary) cor2y = 0,

30 The steepest descent algorithm of Widrow-Hoff We are trying to find the ŵ - vector that minimizes the function L(ŵ) = 1 2( y ˆXŵ ) ( y ˆXŵ ). The negative gradient vector L(ŵ) = L ŵ = ˆX ( y ˆXŵ ) points in the direction of steepest descent. The steepest descent algorithm is an iterative algorithm based on the idea of updating ŵ by moving a suitable distance in the direction of steepest descent at each iteration, i.e. Since ŵnew = ŵ old η ˆX ( y ˆXŵold ) ˆX ( y ˆXŵ ) = l ( yi ˆx i ŵ ) ˆx i i=1 the inner loop in the algorithm in Table 2.3 on p. 23 corresponds to one step of this algorithm, but they are not fully identical because in Table 2.3 we update ŵ for each i in the inner loop so 29

31 when calculating y i ˆx i ŵ, ŵ will contain both new and old values. We apply the steepest descent algorithm below to L(w) = 1 2 (ȳ Xw ) (ȳ Xw ) with the example above. First we show the contour lines of L >> s=-5:0.1:5; >> t=s; >> [w1,w2]=meshgrid(s,t); >> z=0; >> for i=1:ell, z=z0.5*(xbar(i,1).*w1xbar(i,2).*w2-ybar(i)).^2; end >> contour(s,t,z,40) >> xlabel( {w_1} ),ylabel( {w_2} ),title( weight versus loss function (L) ) 5 weight versus loss function (L) w w 1 30

32 >> eta=1; % Try first $\eta=1$. Note that the direction % of steepest descent is always orthogonal % to the contour lines >> w=[3 3] w = 3 3 >> w = w - eta*xbar *(Xbar*w-ybar) w = >> w = w - eta*xbar *(Xbar*w-ybar) w = 1,0e003 * 2,3204-0,7844 >> eta=0.1; % Try again with. >> w=[3 3] w = 3 3 >> w = w - eta*xbar *(Xbar*w-ybar) w = -4,3000 3,1000 >> w = w - eta*xbar *(Xbar*w-ybar) w = 13,0340-4,6940 >> w = w - eta*xbar *(Xbar*w-ybar) w = -36, ,0447 >> eta=0.01; % Try again with $\eta=0.01$. 31

33 >> w=[3 3] w = 3 3 >> w = w - eta*xbar *(Xbar*w-ybar) w = 2,2700 3,0100 >> w = w - eta*xbar *(Xbar*w-ybar) w = 1,7863 2,9411 >> w = w - eta*xbar *(Xbar*w-ybar) % after 100 more iterations we finally get the % right answer with 5 significant digits. w = 0,2603 1,

34 Ridge regression February 13, 2003 Here we aim to strike a balance between fitting input data to output and keeping the absolute values of w and b small, by determining w and b such that: [ ] ˆX µi ŵ [ ] y 0 where I is a (n1) x (n1) identity matrix and 0 is a n-vector of zeros. This is equivalent to setting: ŵ = ( ˆX ˆX λi ) 1 ˆX y where λ = µ 2. 33

35 >> mu=1 mu = 1 >> Xridge=[Xhat;mu*eye(n1)] Xridge = >> yridge=[y;zeros(n1,1)] yridge = >> wb1=xridge\yridge wb1 =

36 >> error=yridge-xridge*wb1 error = >> mu=10 % Put greater weight on making $\hat{\vec{w}}$ small mu = 10 >> Xridge=[Xhat;mu*eye(n1)] Xridge =

37 >> wb10 = Xridge\yridge wb10 = 0,2689 0,4368 0,0608 >> lambda=mu^2 % Here we calculate $\hat{\vec{w}}$ using the equivalent % alternative formulation lambda = 100 >> wb10 = (Xhat *Xhat lambda*eye(n1))\xhat *y wb10 = 0,2689 0,4368 0,

38 Dual ridge regression February 13, 2003 Here we try the dual approach to ridge regression described on pages 23 and 24 in textbook. The central idea that we seek a formulation based on the l l Gram matrix Ĝ = ˆX ˆX rather than the (n 1) (n 1) covariance matrix ˆX ˆX but note that the Gram matrix will be singular if l > n 1. Also note that the derivation in the book holds if we work with ˆX rather than X which we have to do in order to determine b as well as w. In particular if ( λil Ĝ) ˆα = y where ŵ = ˆX ˆα then Ĝ ( λi l Ĝ) ˆα = Ĝy and hence ˆX ( λ ˆX ˆα ˆX ) ˆX ˆX ˆα = ˆX ˆX y which in turn implies that ( λin1 ˆX ˆX)ŵ = ˆX y i.e. the equation for primal ridge regression provided the columns of ˆX are linearly independent. 37

39 >>lambda = 100; % i.e. 10^2 as above >>G=Xhat*Xhat G = >> alpha=(lambda*eye(ell)g)\y alpha = 0,0208 0,0124 0,0097 0,0200 0,0060 0,0068 0,0041-0,0098-0,0069-0,0023 >> wb10=xhat *alpha % Note that these are the same coefficients % as we get with the ordinary ridge regression above. wb10 = >> lambda= % By decreasing $\lambda$ we get closer to the % original regression values lambda = e

40 >> alpha=(lambda*eye(ell)g)\y alpha = 1.0e003 * >> wb001=xhat *alpha wb001 = >> alpha=g\y % The results become, however, meaningless when we % set $\lambda=0$ because $G$ is singular Warning: Matrix is close to singular or badly scaled. Results may be inaccurate. alpha = 1.0e017 * >> wb=xhat *alpha wb =

41 Regression, classification and discriminant analysis February 13, 2003 We can use linear regression to obtain a separator (w, b) for the dataset (x 1, y 1 ),..., (x l, y l ) where y i { 1, 1} simply by solving the system ˆXŵ y. Note that in the normal equations: ˆX ˆXŵ = ˆX y the right hand side is now the vector (of length n 1): [(l 1 m x,1 l 1 m x, 1 ) (l 1 l 1 )] where l 1 and l 1 are the number of data points in groups 1 and 1 respectively. m x,1 and m x, 1 are the (row)vectors of average attribute values within these groups 1. If l 1 = l 1 we are effectively choosing w in such a way that we maximize the difference between the mean value of w x within group 1 on one hand ans group 1 on the other, whicle restricting the variance in these values to be constant. 1 For example: say there are 5 examples in each class in located in the top and bottom half of the data matrix then the r.h.s. matrix ˆX y would be represented in MATLAB like this: 5*mean(Xhat(1:5,:)) - 5*mean(Xhat(6:10,:)) 40

42 This is the separation criterion used in so-called discriminant analysis in statistics (cf. p. 19 in book) The separator obtained in this way is, however, not necessarily a maximum margin separator. >> Ip = 1:5 % indicies for group 1 Ip = >> Im = 6:10 % indicies for group 1 Im = >> y(ip) = 1 y = >> y(im) = -1 y = >> wb = Xhat\y wb = >> Xhat*wb ans = >> %clearly not a max-min margin separator! Why? 41

43 1-norm regression February 13, 2003 When solving an overdetermined system Xŵ y so that Xŵ y is minimized we do to necessarily want to define a length of a vector z 2 = ( i x2 i ) 1/2, called the two norm. An alternative is z 1 = i z i, called the one norm. With such a definition we wish to determine ŵ in such a way that is minimized. l w x i b yi i=1 In order to obtain the 1-norm solution we can formulate the problem as the following linear programming problem: min l i=1 ξ i ξ i subject to ˆXŵ ξ ξ = y, ξ 0, ξ 0 42

44 Note that ξ i = { yi ŵ ˆx i if this error is > 0 0 otherwise. February 13, 2003 ξ i = { yi ŵ ˆx i if this error is < 0 0 otherwise. This problem is readily solved by the MATLAB function: which solve the problem: X = linprog(f,a,b,aeq,beq,low) min f x s.t. Ax b, Aeq = beq, x low. (note that the number of elements in f, x and low and the number of columns in A and Aeq must all be the same) We introduce the following (n 1 2l)-vectors: x = [w b ξ ξ ] f = [zeros(1, n 1) ones(1, 2l)] low = [ inf*ones(1, n 1) zeros(1, 2l)] and set A = zeros(1, n 1 2l) b = 0 Aeq = [ ˆX I l I l ] beq = y 43

45 >> y=[ ] ; >> f=[zeros(1,n1) ones(1,2*ell)] f = Columns 1 through Columns 1 through >> low=[-inf -inf -inf zeros(1,2*ell)] low = Columns 1 through 15 -Inf -Inf -Inf Columns 16 through >> Aeq=[Xhat eye(ell) -eye(ell)] Aeq = Columns 1 through Columns 16 through February 13,

46 >> sol=linprog(f,zeros(1,n12*ell),0,aeq,y,low ) Optimization terminated successfully. sol = % The fist two elements of the solution are the w-values % The third element is the b-value % The next 10 values are the positive differences between ouput values and % calculated values % The last 10 values are the negative differences between ouput values and % calculated values 45

47 >> error = y-xhat*sol(1:3) error = % note how the error relates to the solution. 46

48 Other possibilities February 13, 2003 Another possibility would be to determine ŵ in such a way that max w x i b yi i {i,...,l} is minimized and will be referred to as regression in -norm. In support-vector regression we encounter yet another possibility of determining ŵ. 47

Brief Introduction to Machine Learning

Brief Introduction to Machine Learning Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector

More information

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7. Preliminaries Linear models: the perceptron and closest centroid algorithms Chapter 1, 7 Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also

More information

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

Linear discriminant functions

Linear discriminant functions Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative

More information

Single layer NN. Neuron Model

Single layer NN. Neuron Model Single layer NN We consider the simple architecture consisting of just one neuron. Generalization to a single layer with more neurons as illustrated below is easy because: M M The output units are independent

More information

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1 90 8 80 7 70 6 60 0 8/7/ Preliminaries Preliminaries Linear models and the perceptron algorithm Chapters, T x + b < 0 T x + b > 0 Definition: The Euclidean dot product beteen to vectors is the expression

More information

The Perceptron Algorithm 1

The Perceptron Algorithm 1 CS 64: Machine Learning Spring 5 College of Computer and Information Science Northeastern University Lecture 5 March, 6 Instructor: Bilal Ahmed Scribe: Bilal Ahmed & Virgil Pavlu Introduction The Perceptron

More information

The Perceptron. Volker Tresp Summer 2016

The Perceptron. Volker Tresp Summer 2016 The Perceptron Volker Tresp Summer 2016 1 Elements in Learning Tasks Collection, cleaning and preprocessing of training data Definition of a class of learning models. Often defined by the free model parameters

More information

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:

More information

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lessons 6 10 Jan 2017 Outline Perceptrons and Support Vector machines Notation... 2 Perceptrons... 3 History...3

More information

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs E0 270 Machine Learning Lecture 5 (Jan 22, 203) Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in

More information

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015 Perceptron Subhransu Maji CMPSCI 689: Machine Learning 3 February 2015 5 February 2015 So far in the class Decision trees Inductive bias: use a combination of small number of features Nearest neighbor

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt. 1 The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt. The algorithm applies only to single layer models

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

3.4 Linear Least-Squares Filter

3.4 Linear Least-Squares Filter X(n) = [x(1), x(2),..., x(n)] T 1 3.4 Linear Least-Squares Filter Two characteristics of linear least-squares filter: 1. The filter is built around a single linear neuron. 2. The cost function is the sum

More information

Support Vector Machines for Classification and Regression

Support Vector Machines for Classification and Regression CIS 520: Machine Learning Oct 04, 207 Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may

More information

The Perceptron Algorithm

The Perceptron Algorithm The Perceptron Algorithm Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Outline The Perceptron Algorithm Perceptron Mistake Bound Variants of Perceptron 2 Where are we? The Perceptron

More information

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions Chapter ML:VI VI. Neural Networks Perceptron Learning Gradient Descent Multilayer Perceptron Radial asis Functions ML:VI-1 Neural Networks STEIN 2005-2018 The iological Model Simplified model of a neuron:

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

COMP 875 Announcements

COMP 875 Announcements Announcements Tentative presentation order is out Announcements Tentative presentation order is out Remember: Monday before the week of the presentation you must send me the final paper list (for posting

More information

The Perceptron algorithm

The Perceptron algorithm The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following

More information

Simple Neural Nets For Pattern Classification

Simple Neural Nets For Pattern Classification CHAPTER 2 Simple Neural Nets For Pattern Classification Neural Networks General Discussion One of the simplest tasks that neural nets can be trained to perform is pattern classification. In pattern classification

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input

More information

SGN (4 cr) Chapter 5

SGN (4 cr) Chapter 5 SGN-41006 (4 cr) Chapter 5 Linear Discriminant Analysis Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology January 21, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction

More information

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Lecture 16: Modern Classification (I) - Separating Hyperplanes Lecture 16: Modern Classification (I) - Separating Hyperplanes Outline 1 2 Separating Hyperplane Binary SVM for Separable Case Bayes Rule for Binary Problems Consider the simplest case: two classes are

More information

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

a b = a T b = a i b i (1) i=1 (Geometric definition) The dot product of two Euclidean vectors a and b is defined by a b = a b cos(θ a,b ) (2)

a b = a T b = a i b i (1) i=1 (Geometric definition) The dot product of two Euclidean vectors a and b is defined by a b = a b cos(θ a,b ) (2) This is my preperation notes for teaching in sections during the winter 2018 quarter for course CSE 446. Useful for myself to review the concepts as well. More Linear Algebra Definition 1.1 (Dot Product).

More information

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37 COMP 652: Machine Learning Lecture 12 COMP 652 Lecture 12 1 / 37 Today Perceptrons Definition Perceptron learning rule Convergence (Linear) support vector machines Margin & max margin classifier Formulation

More information

Input layer. Weight matrix [ ] Output layer

Input layer. Weight matrix [ ] Output layer MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science 6.034 Artificial Intelligence, Fall 2003 Recitation 10, November 4 th & 5 th 2003 Learning by perceptrons

More information

Machine Learning A Geometric Approach

Machine Learning A Geometric Approach Machine Learning A Geometric Approach Linear Classification: Perceptron Professor Liang Huang some slides from Alex Smola (CMU) Perceptron Frank Rosenblatt deep learning multilayer perceptron linear regression

More information

Linear Regression. Volker Tresp 2014

Linear Regression. Volker Tresp 2014 Linear Regression Volker Tresp 2014 1 Learning Machine: The Linear Model / ADALINE As with the Perceptron we start with an activation functions that is a linearly weighted sum of the inputs h i = M 1 j=0

More information

The Perceptron. Volker Tresp Summer 2018

The Perceptron. Volker Tresp Summer 2018 The Perceptron Volker Tresp Summer 2018 1 Elements in Learning Tasks Collection, cleaning and preprocessing of training data Definition of a class of learning models. Often defined by the free model parameters

More information

Feedforward Neural Nets and Backpropagation

Feedforward Neural Nets and Backpropagation Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28 th, 2016 1 / 23 Supervised Learning Roadmap Supervised Learning: Assume that we are given the features

More information

In the Name of God. Lecture 11: Single Layer Perceptrons

In the Name of God. Lecture 11: Single Layer Perceptrons 1 In the Name of God Lecture 11: Single Layer Perceptrons Perceptron: architecture We consider the architecture: feed-forward NN with one layer It is sufficient to study single layer perceptrons with just

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

The Perceptron. Volker Tresp Summer 2014

The Perceptron. Volker Tresp Summer 2014 The Perceptron Volker Tresp Summer 2014 1 Introduction One of the first serious learning machines Most important elements in learning tasks Collection and preprocessing of training data Definition of a

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Machine Learning: The Perceptron. Lecture 06

Machine Learning: The Perceptron. Lecture 06 Machine Learning: he Perceptron Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu 1 McCulloch-Pitts Neuron Function 0 1 w 0 activation / output function 1 w 1 w w

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Reading: Ben-Hur & Weston, A User s Guide to Support Vector Machines (linked from class web page) Notation Assume a binary classification problem. Instances are represented by vector

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable

More information

Lecture 4: Linear predictors and the Perceptron

Lecture 4: Linear predictors and the Perceptron Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 4 1 / 34 Inductive Bias Inductive bias is critical to prevent overfitting.

More information

Linear Regression. S. Sumitra

Linear Regression. S. Sumitra Linear Regression S Sumitra Notations: x i : ith data point; x T : transpose of x; x ij : ith data point s jth attribute Let {(x 1, y 1 ), (x, y )(x N, y N )} be the given data, x i D and y i Y Here D

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

CSC242: Intro to AI. Lecture 21

CSC242: Intro to AI. Lecture 21 CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages

More information

Unit 8: Introduction to neural networks. Perceptrons

Unit 8: Introduction to neural networks. Perceptrons Unit 8: Introduction to neural networks. Perceptrons D. Balbontín Noval F. J. Martín Mateos J. L. Ruiz Reina A. Riscos Núñez Departamento de Ciencias de la Computación e Inteligencia Artificial Universidad

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Neural Networks Lecture 2:Single Layer Classifiers

Neural Networks Lecture 2:Single Layer Classifiers Neural Networks Lecture 2:Single Layer Classifiers H.A Talebi Farzaneh Abdollahi Department of Electrical Engineering Amirkabir University of Technology Winter 2011. A. Talebi, Farzaneh Abdollahi Neural

More information

Linear Classifiers. Michael Collins. January 18, 2012

Linear Classifiers. Michael Collins. January 18, 2012 Linear Classifiers Michael Collins January 18, 2012 Today s Lecture Binary classification problems Linear classifiers The perceptron algorithm Classification Problems: An Example Goal: build a system that

More information

Homework 4. Convex Optimization /36-725

Homework 4. Convex Optimization /36-725 Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

COMP9444 Neural Networks and Deep Learning 2. Perceptrons. COMP9444 c Alan Blair, 2017

COMP9444 Neural Networks and Deep Learning 2. Perceptrons. COMP9444 c Alan Blair, 2017 COMP9444 Neural Networks and Deep Learning 2. Perceptrons COMP9444 17s2 Perceptrons 1 Outline Neurons Biological and Artificial Perceptron Learning Linear Separability Multi-Layer Networks COMP9444 17s2

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

Linear Neural Networks

Linear Neural Networks Chapter 10 Linear Neural Networks In this chapter, we introduce the concept of the linear neural network. 10.1 Introduction and Notation 1. The linear neural cell, or node has the schematic form as shown

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECE662: Pattern Recognition and Decision Making Processes: HW TWO ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are

More information

Introduction to Neural Networks

Introduction to Neural Networks Introduction to Neural Networks What are (Artificial) Neural Networks? Models of the brain and nervous system Highly parallel Process information much more like the brain than a serial computer Learning

More information

Computational Intelligence Lecture 3: Simple Neural Networks for Pattern Classification

Computational Intelligence Lecture 3: Simple Neural Networks for Pattern Classification Computational Intelligence Lecture 3: Simple Neural Networks for Pattern Classification Farzaneh Abdollahi Department of Electrical Engineering Amirkabir University of Technology Fall 2011 arzaneh Abdollahi

More information

Linear Classifiers and the Perceptron Algorithm

Linear Classifiers and the Perceptron Algorithm Linear Classifiers and the Perceptron Algorithm 36350, Data Mining 10 November 2008 Contents 1 Linear Classifiers 1 2 The Perceptron Algorithm 3 1 Linear Classifiers Notation: x is a vector of realvalued

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

Chapter 2 Single Layer Feedforward Networks

Chapter 2 Single Layer Feedforward Networks Chapter 2 Single Layer Feedforward Networks By Rosenblatt (1962) Perceptrons For modeling visual perception (retina) A feedforward network of three layers of units: Sensory, Association, and Response Learning

More information

Midterm: CS 6375 Spring 2015 Solutions

Midterm: CS 6375 Spring 2015 Solutions Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an

More information

Learning Methods for Linear Detectors

Learning Methods for Linear Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2011/2012 Lesson 20 27 April 2012 Contents Learning Methods for Linear Detectors Learning Linear Detectors...2

More information

GRADIENT DESCENT AND LOCAL MINIMA

GRADIENT DESCENT AND LOCAL MINIMA GRADIENT DESCENT AND LOCAL MINIMA 25 20 5 15 10 3 2 1 1 2 5 2 2 4 5 5 10 Suppose for both functions above, gradient descent is started at the point marked red. It will walk downhill as far as possible,

More information

PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Single Layer Percetron

PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Single Layer Percetron PMR5406 Redes Neurais e Aula 3 Single Layer Percetron Baseado em: Neural Networks, Simon Haykin, Prentice-Hall, 2 nd edition Slides do curso por Elena Marchiori, Vrije Unviersity Architecture We consider

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Neural networks. Chapter 20, Section 5 1

Neural networks. Chapter 20, Section 5 1 Neural networks Chapter 20, Section 5 Chapter 20, Section 5 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 20, Section 5 2 Brains 0 neurons of

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall

More information

Artificial Neural Networks

Artificial Neural Networks Artificial Neural Networks 鮑興國 Ph.D. National Taiwan University of Science and Technology Outline Perceptrons Gradient descent Multi-layer networks Backpropagation Hidden layer representations Examples

More information

Linear Regression. Volker Tresp 2018

Linear Regression. Volker Tresp 2018 Linear Regression Volker Tresp 2018 1 Learning Machine: The Linear Model / ADALINE As with the Perceptron we start with an activation functions that is a linearly weighted sum of the inputs h = M j=0 w

More information

Optimization and Gradient Descent

Optimization and Gradient Descent Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

ESS2222. Lecture 4 Linear model

ESS2222. Lecture 4 Linear model ESS2222 Lecture 4 Linear model Hosein Shahnas University of Toronto, Department of Earth Sciences, 1 Outline Logistic Regression Predicting Continuous Target Variables Support Vector Machine (Some Details)

More information

CSC321 Lecture 4 The Perceptron Algorithm

CSC321 Lecture 4 The Perceptron Algorithm CSC321 Lecture 4 The Perceptron Algorithm Roger Grosse and Nitish Srivastava January 17, 2017 Roger Grosse and Nitish Srivastava CSC321 Lecture 4 The Perceptron Algorithm January 17, 2017 1 / 1 Recap:

More information

Binary Classification / Perceptron

Binary Classification / Perceptron Binary Classification / Perceptron Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Supervised Learning Input: x 1, y 1,, (x n, y n ) x i is the i th data

More information

More about the Perceptron

More about the Perceptron More about the Perceptron CMSC 422 MARINE CARPUAT marine@cs.umd.edu Credit: figures by Piyush Rai and Hal Daume III Recap: Perceptron for binary classification Classifier = hyperplane that separates positive

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard and Mitch Marcus (and lots original slides by

More information

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the

More information

MLCC 2017 Regularization Networks I: Linear Models

MLCC 2017 Regularization Networks I: Linear Models MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational

More information

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given

More information

Support Vector and Kernel Methods

Support Vector and Kernel Methods SIGIR 2003 Tutorial Support Vector and Kernel Methods Thorsten Joachims Cornell University Computer Science Department tj@cs.cornell.edu http://www.joachims.org 0 Linear Classifiers Rules of the Form:

More information

LECTURE NOTE #8 PROF. ALAN YUILLE. Can we find a linear classifier that separates the position and negative examples?

LECTURE NOTE #8 PROF. ALAN YUILLE. Can we find a linear classifier that separates the position and negative examples? LECTURE NOTE #8 PROF. ALAN YUILLE 1. Linear Classifiers and Perceptrons A dataset contains N samples: { (x µ, y µ ) : µ = 1 to N }, y µ {±1} Can we find a linear classifier that separates the position

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University Multi-class SVMs Lecture 17: Aykut Erdem April 2016 Hacettepe University Administrative We will have a make-up lecture on Saturday April 23, 2016. Project progress reports are due April 21, 2016 2 days

More information

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau Last update: October 26, 207 Neural networks CMSC 42: Section 8.7 Dana Nau Outline Applications of neural networks Brains Neural network units Perceptrons Multilayer perceptrons 2 Example Applications

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

More information

ADALINE for Pattern Classification

ADALINE for Pattern Classification POLYTECHNIC UNIVERSITY Department of Computer and Information Science ADALINE for Pattern Classification K. Ming Leung Abstract: A supervised learning algorithm known as the Widrow-Hoff rule, or the Delta

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information