2018 EE448, Big Data Mining, Lecture 5. (Part II) Weinan Zhang Shanghai Jiao Tong University

Size: px
Start display at page:

Download "2018 EE448, Big Data Mining, Lecture 5. (Part II) Weinan Zhang Shanghai Jiao Tong University"

Transcription

1 2018 EE448, Big Data Mining, Lecture 5 Supervised Learning (Part II) Weinan Zhang Shanghai Jiao Tong University

2 Content of Supervised Learning Introduction to Machine Learning Linear Models Support Vector Machines Neural Networks Tree Models Ensemble Methods

3 Content of This Lecture Support Vector Machines Neural Networks

4 Linear Classification For linear separable cases, we have multiple decision boundaries

5 Linear Classification For linear separable cases, we have multiple decision boundaries Ruling out some separators by considering data noise

6 Linear Classification For linear separable cases, we have multiple decision boundaries The intuitive optimal decision boundary: the largest margin

7 Review: Logistic Regression Logistic regression is a binary classification model p μ (y =1jx) =¾(μ > x)= p μ (y =0jx) = e μ> x 1+e μ> x 1 1+e μ> x Cross entropy loss function Gradient L(y;x; p μ )= ylog ¾(μ > x) (1 y)log(1 ¾(μ > x)) x; p μ ) 1 = ¾(μ > x) ¾(z)(1 ¾(z))x (1 y) 1 1 ¾(μ > ¾(z)(1 ¾(z))x x) =(¾(μ > x) μ Ã μ + (y ¾(μ > x))x = ¾(z)(1 x

8 Label Decision Logistic regression provides the probability p μ (y =1jx) =¾(μ > x)= p μ (y =0jx) = e μ> x 1+e μ> x The final label of an instance is decided by setting a threshold h ^y = ( 1; p μ (y =1jx) >h 0; otherwise 1 1+e μ> x

9 Logistic Regression Scores x 2 s(x) =μ 0 + μ 1 x 1 + μ 2 x 2 p μ (y =1jx) = 1 1+e s(x) s(x) =μ 0 + μ 1 x (A) 1 + μ 2 x (A) 2 Decision boundary s(x) =μ 0 + μ 1 x (B) 1 + μ 2 x (B) 2 s(x) =μ 0 + μ 1 x (C) 1 + μ 2 x (C) 2 s(x) =0 The higher score, the larger distance to the decision boundary, the higher confidence x 1 Example from Andrew Ng

10 Linear Classification The intuitive optimal decision boundary: the highest confidence

11 Notations for SVMs Feature vector Class label Parameters Intercept b Feature weight vector Label prediction x y 2f 1; 1g w h w;b (x) =g(w > x + b) ( +1 z 0 g(z) = 1 otherwise

12 Logistic Regression Scores x 2 s(x) =b + w 1 x 1 + w 2 x 2 p μ (y =1jx) = 1 1+e s(x) s(x) =b + w 1 x (A) 1 + w 2 x (A) 2 Decision boundary s(x) =b + w 1 x (B) 1 + w 2 x (B) 2 s(x) =b + w 1 x (C) 1 + w 2 x (C) 2 s(x) =0 The higher score, the larger distance to the separating hyperplane, the higher confidence x 1 Example from Andrew Ng

13 Margins x 2 Functional margin ^ (i) = y (i) (w > x (i) + b) (B) w Note that the separating hyperplane won t change with the magnitude of (w, b) g(w > x + b) =g(2w > x +2b) Decision boundary Geometric margin (i) = y (i) (w > x (i) + b) where kwk 2 =1 x 1

14 Margins x 2 Decision boundary μ w > x (i) (i) y (i) w kwk + b =0 (B) w ) (i) = y (i) w> x (i) + b kwk μ ³ w >x = y (i) (i) + kwk b kwk Given a training set Decision boundary x 1 S = f(x i ;y i )g i=1:::m the smallest geometric margin = min i=1:::m (i)

15 Objective of an SVM Find a separable hyperplane that maximizes the minimum geometric margin max ;w;b s.t. y (i) (w > x (i) + b) ; i =1;:::;m kwk =1 (non-convex constraint) Equivalent to normalized functional margin max ^ ;w;b ^ kwk (non-convex objective) s.t. y (i) (w > x (i) + b) ^ ; i =1;:::;m

16 Objective of an SVM Functional margin scales w.r.t. (w,b) without changing the decision boundary. Let s fix the functional margin at 1. Objective is written as max w;b Equivalent with 1 kwk ^ =1 s.t. y (i) (w > x (i) + b) 1; i=1;:::;m min w;b 1 2 kwk2 s.t. y (i) (w > x (i) + b) 1; i=1;:::;m This optimization problem can be efficiently solved by quadratic programming

17 A Digression of Lagrange Duality in Convex Optimization Boyd, Stephen, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

18 Lagrangian for Convex Optimization A convex optimization problem min w f(w) s.t. h i (w) =0; i =1;:::;l The Lagrangian of this problem is defined as Solving L(w; ) =f(w)+ lx ih i (w) ) Lagrangian multipliers yields the solution of the original optimization problem.

19 Lagrangian for Convex Optimization w 2 f(w) =5 f(w) =4 f(w) =3 h(w) =0 L(w; ) =f(w)+ @w w =0 i.e., two gradients on the same direction

20 With Inequality Constraints A convex optimization problem min w f(w) s.t. g i (w) 0; i =1;:::;k h i (w) =0; i =1;:::;l The Lagrangian of this problem is defined as L(w; ; ) =f(w)+ kx i g i (w)+ i=1 lx ih i (w) i=1 Lagrangian multipliers

21 Primal Problem A convex optimzation min w f(w) s.t. g i (w) 0; i =1;:::;k h i (w) =0; i =1;:::;l The primal problem The Lagrangian kx L(w; ; ) =f(w)+ i g i (w)+ i=1 lx ih i (w) i=1 μ P P(w) = max ; : i 0 L(w; ; ) If a given w violates any constraints, i.e., Then μ P (w) =+1 g i (w) > 0 or h i (w) 6= 0

22 Primal Problem A convex optimzation min w f(w) s.t. g i (w) 0; i =1;:::;k h i (w) =0; i =1;:::;l The primal problem The Lagrangian kx L(w; ; ) =f(w)+ i g i (w)+ i=1 lx ih i (w) i=1 μ P P(w) = max ; : i 0 L(w; ; ) Conversely, if all constraints are satisfied for Then μ P (w) =f(w) μ P (w) = ( f(w) w if w satis es primal constraints +1 otherwise

23 Primal Problem μ P (w) = ( f(w) if w satis es primal constraints +1 otherwise The minimization problem min μ P(w) =min w w max ; : i 0 L(w; ; ) is the same as the original problem min f(w) w s.t. g i (w) 0; i =1;:::;k h i (w) =0; i =1;:::;l Define the value of the primal problem p =min μ P(w) w

24 Dual Problem A slightly different problem Define the dual optimization problem Min & Max exchanged compared to the primal problem min μ P(w) =min w w max ; : i 0 Define the value of the dual problem d d = max min ; : i 0 w L(w; ; ) L(w; ; )

25 Primal Problem vs. Dual Problem d = max min ; : i 0 w L(w; ; ) min w max ; : i 0 L(w; ; ) =p Proof ) max min L(w; ; ) L(w; ; ); 8w; 0; w min ; : 0 w ) max min ; : 0 w L(w; ; ) max ; : 0 L(w; ; ) min w L(w; ; ); 8w max ; : 0 L(w; ; ) But under certain condition d = p

26 Karush-Kuhn-Tucker (KKT) Conditions If f and g i s are convex and h i s are affine, and suppose g i s are all strictly feasible then there must exist w *, α *, β * w * is the solution of the primal problem α *, β * are the solutions of the dual problem and the values of the two problems are equal And w *, α *, β * satisfy the KKT conditions KKT dual complementarity condition Moreover, if some w *, α *, β * satisfy the KKT conditions, then it is also a solution to the primal and dual problems. More details please refer to Boyd Convex optimization 2004.

27 Now Back to SVM Problem

28 Objective of an SVM SVM objective: finding the optimal margin classifier min w;b 1 2 kwk2 s.t. y (i) (w > x (i) + b) 1; i=1;:::;m Re-wright the constraints as g i (w) = y (i) (w > x (i) + b)+1 0 so as to match the standard optimization form min w f(w) s.t. g i (w) 0; h i (w) =0; i =1;:::;k i =1;:::;l

29 Equality Cases g i (w) = y (i) (w > x (i) + b)+1=0 )y (i)³ w > x (i) kwk + b = 1 kwk kwk Geometric margin The g i s = 0 cases correspond to the training examples that have functional margin exactly equal to 1.

30 Objective of an SVM SVM objective: finding the optimal margin classifier min w;b 1 2 kwk2 s.t. y (i) (w > x (i) + b)+1 0; i =1;:::;m Lagrangian L(w; b; ) = 1 2 kwk2 mx i [y (i) (w > x (i) + b) 1] i=1 No β or equality constraints in SVM problem

31 Solving L(w; b; ) = 1 2 kwk2 L(w; b; ) =w X i y (i) x (i) =0 ) w = L(w; b; ) = 1 2 = L(w; b; ) = X i y (i) =0 i=1 mx i [y (i) (w > x (i) + b) 1] i=1 Then Lagrangian is re-written as mx i y (i) x (i) i=1 mx i y (i) x (i) 2 X m i [y (i) (w > x (i) + b) 1] i=1 mx i 1 2 i=1 mx i;j=1 i=1 y (i) y (j) i j x (i)> x (j) b mx i y (i) i=1 = 0

32 Solving α * Dual problem max max μ D( ) =max 0 mx W ( ) = i 1 mx 2 i=1 s.t. i 0; i=1;:::;m mx i y (i) =0 i=1 min 0 w;b i;j=1 L(w; b; ) To solve α * with some methods e.g. SMO We will get back to this solution later y (i) y (j) i j x (i)> x (j)

33 Solving w * and b * With α * solved, w * is obtained by w = mx i y (i) x (i) i=1 Only supporting vectors with α > 0 With w * solved, b * is obtained by b = max i:y (i) = 1 w > x (i) +min i:y (i) =1 w > x (i) 2 w

34 Predicting Values With the solutions of w * and b *, the predicting value (i.e. functional margin) of each instance is w > x + b = = ³ X m i y (i) x (i) > x + b i=1 mx i y (i) hx (i) ;xi + b i=1 We only need to calculate the inner product of x with the supporting vectors

35 Non-Separable Cases The derivation of the SVM as presented so far assumes that the data is linearly separable. More practical cases are linearly non-separable.

36 Dealing with Non-Separable Cases Add slack variables Lagrangian L(w; b;»; ; r) = 1 2 w> w + C Dual problem max W ( ) = min w;b 1 mx 2 kwk2 + C i=1» i L1 regularization s.t. y (i) (w > x (i) + b) 1» i ;i=1;:::;m» i 0; i =1;:::;m mx mx mx» i i [y (i) (x > w + b) 1+» i ] r i» i i=1 i=1 i=1 mx i 1 2 i=1 mx y (i) y (j) i j x (i)> x (j) i;j=1 s.t. 0 i C; i =1;:::;m mx i y (i) =0 i=1 Surprisingly, this is the only change Efficiently solved by SMO algorithm

37 SVM Hinge Loss vs. LR Loss SVM Hinge loss 1 2 kwk2 + C mx max(0; 1 y i (w > x i + b)) i=1 LR log loss y i log ¾(w > x i + b) (1 y i )log(1 ¾(w > x i + b)) If y = 1

38 Coordinate Ascent (Descent) For the optimization problem max W ( 1; 2 ;:::; m ) Coordinate ascent algorithm Loop until convergence: f For i =1;:::;m f i := arg max W ( 1 ;:::; i 1 ; ^ i ; i+1 ;:::; m ) ^ i g g

39 Coordinate Ascent (Descent) A two-dimensional coordinate ascent example

40 SMO Algorithm SMO: sequential minimal optimization SVM optimization problem max W ( ) = mx i 1 2 i=1 mx y (i) y (j) i j x (i)> x (j) i;j=1 s.t. 0 i C; i =1;:::;m mx i y (i) =0 i=1 Cannot directly apply coordinate ascent algorithm because mx i y (i) =0 ) i y (i) = X j y (j) i=1 j6=i

41 SMO Algorithm Update two variable each time Loop until convergence { 1. Select some pair α i and α j to update next 2. Re-optimize W(α) w.r.t. α i and α j } Convergence test: whether the change of W(α) is smaller than a predefined value (e.g. 0.01) Key advantage of SMO algorithm is the update of α i and α j (step 2) is efficient

42 SMO Algorithm max W ( ) = mx i 1 2 i=1 i;j=1 s.t. 0 i C; i =1;:::;m mx i y (i) =0 i=1 Without loss of generality, hold α 3 α m and optimize W(α) w.r.t. α 1 and α 2 1 y (1) + 2 y (2) = mx y (i) y (j) i j x (i)> x (j) mx i y (i) = ³ i=3 ) 2 = y(1) y (2) 1 + ³ y (2) 1 =(³ 2 y (2) )y (1)

43 SMO Algorithm With 1 =(³ 2 y (2) )y (1), the objective is written as W ( 1 ; 2 ;:::; m )=W((³ 2 y (2) )y (1) ; 2 ;:::; m ) Thus the original optimization problem max W ( ) = mx i 1 2 i=1 mx y (i) y (j) i j x (i)> x (j) i;j=1 s.t. 0 i C; i =1;:::;m mx i y (i) =0 i=1 is transformed into a quadratic optimization problem w.r.t. 2 max W ( 2 )=a b 2 + c 2 s.t. 0 2 C

44 SMO Algorithm Optimizing a quadratic function is much efficient W ( 2 ) W ( 2 ) W ( 2 ) 0 b 2a C 2 0 C 2 0 C 2 max W ( 2 )=a b 2 + c 2 s.t. 0 2 C

45 Content of This Lecture Support Vector Machines Neural Networks

46 Breaking News of AI in 2016 AlphaGo wins Lee Sedol (4-1)

47 Machine Learning in AlphaGo Policy Network Supervised Learning Predict what is the best next human move Reinforcement Learning Learning to select the next move to maximize the winning rate Value Network Expectation of winning given the board state Implemented by (deep) neural networks

48 Neural Networks Neural networks are the basis of deep learning Perceptron Multi-layer Perceptron Convolutional Neural Network Recurrent Neural Network

49 Real Neurons Cell structures Cell body Dendrites Axon Synaptic terminals Slides credit: Ray Mooney

50 Neural Communication Electrical potential across cell membrane exhibits spikes called action potentials. Spike originates in cell body, travels down axon, and causes synaptic terminals to release neurotransmitters. Chemical diffuses across synapse to dendrites of other neurons. Neurotransmitters can be excitatory or inhibitory. If net input of neurotransmitters to a neuron from other neurons is excitatory and exceeds some threshold, it fires an action potential. Slides credit: Ray Mooney

51 Real Neural Learning Synapses change size and strength with experience. Hebbian learning: When two connected neurons are firing at the same time, the strength of the synapse between them increases. Neurons that fire together, wire together. These motivate the research of artificial neural nets Slides credit: Ray Mooney

52 Brief History of Artificial Neural Nets The First wave 1943 McCulloch and Pitts proposed the McCulloch-Pitts neuron model 1958 Rosenblatt introduced the simple single layer networks now called Perceptrons Minsky and Papert s book Perceptrons demonstrated the limitation of single layer perceptrons, and almost the whole field went into hibernation. The Second wave 1986 The Back-Propagation learning algorithm for Multi-Layer Perceptrons was rediscovered and the whole field took off again. The Third wave 2006 Deep (neural networks) Learning gains popularity and 2012 made significant break-through in many applications. Slides credit: Jun Wang

53 Artificial Neuron Model Model network as a graph with cells as nodes and synaptic connections as weighted edges from node i to node j, w ji Model net input to cell as 1 w 12 w 13 w 14w 15 w 16 Cell output is o j = net j = X w ji o i i ( 0 if net j <T j 1 if net j T j o j (T j is threshold for unit j) McCulloch and Pitts [1943] 0 T j net j Slides credit: Ray Mooney

54 Perceptron Model Rosenblatt s single layer perceptron [1958] Rosenblatt [1958] further proposed the perceptron as the first model for learning with a teacher (i.e., supervised learning) Prediction ³ X m ^y = ' w i x i + b i=1 Focused on how to find appropriate weights w m for two-class classification task Activation function ( y = 1: class one 1 if z 0 y = -1: class two '(z) = 1 otherwise

55 Training Perceptron Rosenblatt s single layer perceptron [1958] Training w i = w i + (y ^y)x i b = b + (y ^y) Prediction ³ X m ^y = ' w i x i + b i=1 Activation function ( 1 if z 0 '(z) = 1 otherwise Equivalent to rules: If output is correct, do nothing If output is high, lower weights on positive inputs If output is low, increase weights on active inputs

56 Properties of Perceptron Rosenblatt s single layer perceptron [1958] Rosenblatt proved the x 2 convergence of a learning Class 1 algorithm if two classes said to be linearly separable (i.e., patterns that lie on opposite sides of a hyperplane) Class 2 w 1 x 1 + w 2 x 2 + b =0 x 1 Many people hoped that such a machine could be the basis for artificial intelligence

57 Properties of Perceptron The XOR problem Input x Output y X 1 X 2 X 1 XOR X X1 1 true false false true 0 1 X2 XOR is non linearly separable: These two classes (true and false) cannot be separated using a line. However, Minsky and Papert [1969] showed that some rather elementary computations, such as XOR problem, could not be done by Rosenblatt s one-layer perceptron However Rosenblatt believed the limitations could be overcome if more layers of units to be added, but no learning algorithm known to obtain the weights yet Due to the lack of learning algorithms people left the neural network paradigm for almost 20 years

58 Hidden Layers and Backpropagation (1986~) Adding hidden layer(s) (internal presentation) allows to learn a mapping that is not constrained by linearly separable b class 1 x 1 w 1 y class 2 decision boundary: x 1 w 1 + x 2 w 2 + b = 0 x 1 w 2 b class 2 class 2 class 1 class 2 class 2 Each hidden node realizes one of the lines bounding the convex region x 1 y x 2

59 Hidden Layers and Backpropagation (1986~) But the solution is quite often not unique Input x Output y X 1 X 2 X 1 XOR X (solution 1) (solution 2) Two lines are necessary to divide the sample space accordingly Sign activation function The number in the circle is a threshold

60 Hidden Layers and Backpropagation (1986~) Feedforward: massages move forward from the input nodes, through the hidden nodes (if any), and to the output nodes. There are no cycles or loops in the network Weight Parameters Weight Parameters Two-layer feedforward neural network

61 Single / Multiple Layers of Calculation Single layer function f μ (x) =μ 0 + μ 1 x + μ 2 x 2 f μ (x) =¾(μ 0 + μ 1 x + μ 2 x 2 ) Multiple layer function h 1 (x) =tanh(μ 0 + μ 1 x + μ 2 x 2 ) h 2 (x) =tanh(μ 3 + μ 4 x + μ 5 x 2 ) f μ (x) =f μ (h 1 (x);h 2 (x)) = ¾(μ 6 + μ 7 h 1 + μ 8 h 2 ) x x 2 x 2 h 1 (x) f μ (x) x x 2 x 2 h 2 (x) With non-linear activation function 1 ¾(x) = 1+e x tanh(x) = 1 e 2x 1+e 2x

62 Non-linear Activation Functions Sigmoid Tanh 1 ¾(z) = 1+e z tanh(z) = 1 e 2z 1+e 2z Rectified Linear Unit (ReLU) ReLU(z) =max(0;z)

63 Universal Approximation Theorem A feed-forward network with a single hidden layer containing a finite number of neurons (i.e., a multilayer perceptron), can approximate continuous functions on compact subsets of R n under mild assumptions on the activation function Such as Sigmoid, Tanh and ReLU [Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. "Multilayer feedforward networks are universal approximators." Neural networks 2.5 (1989): ]

64 Universal Approximation Multi-layer perceptron approximate any continuous functions on compact subset of R n 1 ¾(x) = 1+e x tanh(x) = 1 e 2x 1+e 2x

65 Hidden Layers and Backpropagation (1986~) One of the efficient algorithms for multi-layer neural networks is the Backpropagation algorithm It was re-introduced in 1986 and Neural Networks regained the popularity Error backpropagation Parameters weights Parameters weights Error Caculation Note: backpropagation appears to be found by Werbos [1974]; and then independently rediscovered around 1985 by Rumelhart, Hinton, and Williams [1986] and by Parker [1985]

66 Learning NN by Back-Propagation Compare outputs with correct answer to get = k y k [LeCun, Bengio and Hinton. Deep Learning. Nature 2015.]

67 Learning NN by Back-Propagation Error Back-propagation x 1 Parameters weights Parameters weights Error Calculation d 1 =1 x 2 x m y 1 y 0 d 2 = 0 label = Face label = no face Training instances

68 Make a Prediction x 1 x 2 x m w (1) j;m net(1) (1) 1 h (1) 1 inputs outputs labels X f(1) f (1) net (1) 2 h (1) 2 X f(1) f (1) net (1) j h (1) j X f(1) f (1) w (2) k;j net (2) 1 X f(2) f (2) net (2) k X f(2) f (2) y 1 y k d 1 d k Feed-forward prediction: Input layer hidden layer output layer Two-layer feedforward neural network x =(x 1 ;:::;x m ) h (1) j = f (1) (net (1) j )=f (1) ( X m w (1) j;m x m) h (1) j y k = f (2) (net (2) k )=f (2)( X j w (1) k;j h(1) j ) y k where net (1) j = X m w (1) j;m x m j;m x m net (2) k = X j w (2) k;j h(1) j

69 Make a Prediction x 1 x 2 x m w (1) j;m net(1) (1) 1 h (1) 1 inputs outputs labels X f(1) f (1) net (1) 2 h (1) 2 X f(1) f (1) net (1) j h (1) j X f(1) f (1) w (2) k;j net (2) 1 X f(2) f (2) net (2) k X f(2) f (2) y 1 y k d 1 d k Feed-forward prediction: input layer hidden layer output layer Two-layer feedforward neural network x =(x 1 ;:::;x m ) h (1) j = f (1) (net (1) j )=f (1) ( X m w (1) j;m x m) h (1) j y k = f (2) (net (2) k )=f (2)( X j w (1) k;j h(1) j ) y k where net (1) j = X m w (1) j;m x m j;m x m net (2) k = X j w (2) k;j h(1) j

70 Make a Prediction x 1 x 2 x m w (1) j;m net(1) (1) 1 h (1) 1 inputs outputs labels X f(1) f (1) net (1) 2 h (1) 2 X f(1) f (1) net (1) j h (1) j X f(1) f (1) w (2) k;j net (2) 1 X f(2) f (2) net (2) k X f(2) f (2) y 1 y k d 1 d k Feed-forward prediction: Input layer hidden layer output layer Two-layer feedforward neural network x =(x 1 ;:::;x m ) h (1) j = f (1) (net (1) j )=f (1) ( X m w (1) j;m x m) h (1) j y k = f (2) (net (2) k )=f (2)( X j w (1) k;j h(1) j ) y k where net (1) j = X m w (1) j;m x m j;m x m net (2) k = X j w (2) k;j h(1) j

71 When Backprop/Learn Parameters x 1 x 2 x m w (1) j;m net(1) (1) 1 h (1) 1 inputs outputs labels X f(1) f (1) net (1) 2 h (1) 2 X f(1) f (1) net (1) j h (1) j X f(1) f (1) w (2) k;j net (2) 1 X f(2) f (2) net (2) k X f(2) f (2) Input layer hidden layer output layer Two-layer feedforward neural network Notations: net (1) w (1) net (2) j = X m Backprop to learn the parameters w (2) k;j = w(2) k;j + w(2) k;j w (2) (2) k;j w (2) y 1 y k d k y k ± k =(d k y k )f(2) 0 (net(2) j;m x m net (2) k = X j k;j = Error koutput j = ± k h (1) (2) k = = (y k d k (2) k k;j k ) w (2) k;j h j E(W )= 1 2 = (d k y k )f(2) 0 (net(2) k d 1 d k X (y k d k ) 2 k )h(1) j = ± k h (1) j

72 When Backprop/Learn Parameters x 1 x 2 x m w (1) j;m net(1) (1) 1 h (1) 1 inputs outputs labels X f(1) f (1) net (1) 2 h (1) 2 X f(1) f (1) net (1) j h (1) j X f(1) f (1) w (2) k;j net (2) 1 X f(2) f (2) net (2) k X f(2) f (2) Input layer hidden layer output layer Two-layer feedforward neural network Notations: net (1) w (1) net (2) j = X m Backprop to learn the parameters w (1) j;m w (1) j;m = w(1) j;m + w(1) (1) j;m w ) = (1) j y 1 y k d k y k ± k =(d k y k )f(2) 0 (net(2) j;m x m net (2) k = X j k;j = Error joutput m = ± j x (1) (1) j;m = X (d k y k )f(2) 0 (net(2) k k k ) w (2) k;j h j E(W )= 1 2 )w(2) d 1 d k X (y k d k ) 2 k k;j x mf(1) 0 (net(1) j ) = ± j x m

73 An example for Backprop

74 An example for Backprop δ k = (d k y k ) f (2) '(net k (2) ) Δw (2) (1) k, j = ηerror k Output j = ηδ k h j Consider sigmoid activation function 1 f Sigmoid (x) = 1+ e x δ j = f (1) '(net j (1) ) k (2) δ k w k, j (1) Δw j,m = ηerror j Output m = ηδ j x m f ' Sigmoid (x) = f Sigmoid (x)(1 f Sigmoid (x))

75 Let us do some calculation Consider the simple network below: Assume that the neurons have a Sigmoid activation function and 1. Perform a forward pass on the network 2. Perform a reverse pass (training) once (target = 0.5) 3. Perform a further forward pass and comment on the result

76 Let us do some calculation

77 A demo from Google

78 Non-linear Activation Functions Sigmoid Tanh 1 ¾(z) = 1+e z tanh(z) = 1 e 2z 1+e 2z Rectified Linear Unit (ReLU) ReLU(z) =max(0;z)

79 Active functions f linear (x) f Sigmoid (x) f tanh (x) f ' linear (x) f ' Sigmoid (x) f ' tanh (x)

80 Activation functions Logistic Sigmoid: 1 f Sigmoid (x) = 1+ e x Its derivative: f ' Sigmoid (x) = f Sigmoid (x)(1 f Sigmoid (x)) Output range [0,1] Motivated by biological neurons and can be interpreted as the probability of an artificial neuron firing given its inputs However, saturated neurons make gradients vanished (why?)

81 Activation functions Tanh function f tanh (x) = sinh(x) cosh(x) = ex e x e x + e x Its gradient: f tanh (x) =1 f tanh (x) 2 Output range [-1,1] Thus strongly negative inputs to the tanh will map to negative outputs. Only zero-valued inputs are mapped to near-zero outputs These properties make the network less likely to get stuck during training

82 Active Functions ReLU (rectified linear unit) f ReLU (x) =max(0;x) The derivative: f ReLU (x) = Another version is Noise ReLU: ( 1 if x>0 0 if x 0 f NoisyReLU (x) =max(0;x+ N(0;±(x))) ReLU can be approximated by softplus function f Softplus (x) =log(1+e x ) google.com/en//pubs/archive/40811.pdf ReLU gradient doesn't vanish as we increase x It can be used to model positive number It is fast as no need for computing the exponential function It eliminates the necessity to have a pretraining phase

83 Active Functions ReLU (rectified linear unit) f ReLU (x) =max(0;x) ReLU can be approximated by softplus function f Softplus (x) =log(1+e x ) The only non-linearity comes from the path selection with individual neurons being active or not It allows sparse representations: for a given input only a subset of neurons are active Additional active functions: Leaky ReLU, Exponential LU, Maxout etc Sparse propagation of activations and gradients

84 Error/Loss function Recall stochastic gradient descent Update from a randomly picked example (but in practice do a batch update) w Squared error loss for one binary output: L(w) = 1 2 (y f w(x)) 2 x input f w (x) output

85 Error/Loss function Softmax (cross-entropy loss) for multiple classes L(w) = X k (Class labels follow multinomial distribution) w (1) j;m (d k log ^y k +(1 d k )log(1 y k )) net(1) (1) 1 h (1) 1 X f(1) f (1) net (1) 2 h (1) 2 X f(1) f (1) net (1) j h (1) j w (2) k;j net (2) 1 X f(2) f (2) net (2) k X f(2) f (2) outputs y 1 y k where labels ^y k = d 1 d k ³ P exp j w(2) j P ³ P k 0 exp j w(2) k;j h(1) k 0 ;j h(1) j X f(1) f (1) hidden layer output layer One hot encoded class labels

86 For more machine learning details, you can check out my machine learning course at Zhiyuan College CS420 Machine Learning Weinan Zhang Course webpage:

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

More information

CS420 Machine Learning, Lecture 4. Neural Networks. Weinan Zhang Shanghai Jiao Tong University

CS420 Machine Learning, Lecture 4. Neural Networks. Weinan Zhang Shanghai Jiao Tong University CS420 Machine Learning, Lecture 4 Neural Networks Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Breaking News of AI in 2016 AlphaGo wins Lee

More information

Introduction Biologically Motivated Crude Model Backpropagation

Introduction Biologically Motivated Crude Model Backpropagation Introduction Biologically Motivated Crude Model Backpropagation 1 McCulloch-Pitts Neurons In 1943 Warren S. McCulloch, a neuroscientist, and Walter Pitts, a logician, published A logical calculus of the

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Artificial Neural Networks. MGS Lecture 2

Artificial Neural Networks. MGS Lecture 2 Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation

More information

Course 395: Machine Learning - Lectures

Course 395: Machine Learning - Lectures Course 395: Machine Learning - Lectures Lecture 1-2: Concept Learning (M. Pantic) Lecture 3-4: Decision Trees & CBC Intro (M. Pantic & S. Petridis) Lecture 5-6: Evaluating Hypotheses (S. Petridis) Lecture

More information

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning. Support Vector Machines. Manfred Huber Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Artificial Neural Networks. Q550: Models in Cognitive Science Lecture 5

Artificial Neural Networks. Q550: Models in Cognitive Science Lecture 5 Artificial Neural Networks Q550: Models in Cognitive Science Lecture 5 "Intelligence is 10 million rules." --Doug Lenat The human brain has about 100 billion neurons. With an estimated average of one thousand

More information

Feedforward Neural Nets and Backpropagation

Feedforward Neural Nets and Backpropagation Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28 th, 2016 1 / 23 Supervised Learning Roadmap Supervised Learning: Assume that we are given the features

More information

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo

More information

Neural Networks (Part 1) Goals for the lecture

Neural Networks (Part 1) Goals for the lecture Neural Networks (Part ) Mark Craven and David Page Computer Sciences 760 Spring 208 www.biostat.wisc.edu/~craven/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines: Maximum Margin Classifiers Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1 Outline What is behind

More information

Machine Learning Lecture 10

Machine Learning Lecture 10 Machine Learning Lecture 10 Neural Networks 26.11.2018 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Today s Topic Deep Learning 2 Course Outline Fundamentals Bayes

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Introduction to Neural Networks

Introduction to Neural Networks Introduction to Neural Networks What are (Artificial) Neural Networks? Models of the brain and nervous system Highly parallel Process information much more like the brain than a serial computer Learning

More information

Neural networks. Chapter 20. Chapter 20 1

Neural networks. Chapter 20. Chapter 20 1 Neural networks Chapter 20 Chapter 20 1 Outline Brains Neural networks Perceptrons Multilayer networks Applications of neural networks Chapter 20 2 Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms

More information

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA   1/ 21 Neural Networks Chapter 8, Section 7 TB Artificial Intelligence Slides from AIMA http://aima.cs.berkeley.edu / 2 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts Sridhar Mahadevan: CMPSCI 689 p. 1/32 Margin Classifiers margin b = 0 Sridhar Mahadevan: CMPSCI 689 p.

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

Artificial Neural Networks

Artificial Neural Networks Artificial Neural Networks 鮑興國 Ph.D. National Taiwan University of Science and Technology Outline Perceptrons Gradient descent Multi-layer networks Backpropagation Hidden layer representations Examples

More information

COMP9444 Neural Networks and Deep Learning 2. Perceptrons. COMP9444 c Alan Blair, 2017

COMP9444 Neural Networks and Deep Learning 2. Perceptrons. COMP9444 c Alan Blair, 2017 COMP9444 Neural Networks and Deep Learning 2. Perceptrons COMP9444 17s2 Perceptrons 1 Outline Neurons Biological and Artificial Perceptron Learning Linear Separability Multi-Layer Networks COMP9444 17s2

More information

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 12 Neural Networks 10.12.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

Neural networks. Chapter 19, Sections 1 5 1

Neural networks. Chapter 19, Sections 1 5 1 Neural networks Chapter 19, Sections 1 5 Chapter 19, Sections 1 5 1 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 19, Sections 1 5 2 Brains 10

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Machine Learning Lecture 12

Machine Learning Lecture 12 Machine Learning Lecture 12 Neural Networks 30.11.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability

More information

ICS-E4030 Kernel Methods in Machine Learning

ICS-E4030 Kernel Methods in Machine Learning ICS-E4030 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 28. September, 2016 Juho Rousu 28. September, 2016 1 / 38 Convex optimization Convex optimisation This

More information

Support Vector Machine (continued)

Support Vector Machine (continued) Support Vector Machine continued) Overlapping class distribution: In practice the class-conditional distributions may overlap, so that the training data points are no longer linearly separable. We need

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Feed-forward Network Functions

Feed-forward Network Functions Feed-forward Network Functions Sargur Srihari Topics 1. Extension of linear models 2. Feed-forward Network Functions 3. Weight-space symmetries 2 Recap of Linear Models Linear Models for Regression, Classification

More information

CSC242: Intro to AI. Lecture 21

CSC242: Intro to AI. Lecture 21 CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages

More information

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs) Multilayer Neural Networks (sometimes called Multilayer Perceptrons or MLPs) Linear separability Hyperplane In 2D: w x + w 2 x 2 + w 0 = 0 Feature x 2 = w w 2 x w 0 w 2 Feature 2 A perceptron can separate

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

More information

Machine Learning A Geometric Approach

Machine Learning A Geometric Approach Machine Learning A Geometric Approach CIML book Chap 7.7 Linear Classification: Support Vector Machines (SVM) Professor Liang Huang some slides from Alex Smola (CMU) Linear Separator Ham Spam From Perceptron

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) Human Brain Neurons Input-Output Transformation Input Spikes Output Spike Spike (= a brief pulse) (Excitatory Post-Synaptic Potential)

More information

Neural networks. Chapter 20, Section 5 1

Neural networks. Chapter 20, Section 5 1 Neural networks Chapter 20, Section 5 Chapter 20, Section 5 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 20, Section 5 2 Brains 0 neurons of

More information

Linear Regression, Neural Networks, etc.

Linear Regression, Neural Networks, etc. Linear Regression, Neural Networks, etc. Gradient Descent Many machine learning problems can be cast as optimization problems Define a function that corresponds to learning error. (More on this later)

More information

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Machine Learning. Lecture 6: Support Vector Machine. Feng Li. Machine Learning Lecture 6: Support Vector Machine Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Warm Up 2 / 80 Warm Up (Contd.)

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

Neural Networks: Introduction

Neural Networks: Introduction Neural Networks: Introduction Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others 1

More information

Machine Learning. Neural Networks

Machine Learning. Neural Networks Machine Learning Neural Networks Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 Biological Analogy Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 THE

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs E0 270 Machine Learning Lecture 5 (Jan 22, 203) Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in

More information

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler + Machine Learning and Data Mining Multi-layer Perceptrons & Neural Networks: Basics Prof. Alexander Ihler Linear Classifiers (Perceptrons) Linear Classifiers a linear classifier is a mapping which partitions

More information

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University Multi-class SVMs Lecture 17: Aykut Erdem April 2016 Hacettepe University Administrative We will have a make-up lecture on Saturday April 23, 2016. Project progress reports are due April 21, 2016 2 days

More information

Neural Networks and Deep Learning.

Neural Networks and Deep Learning. Neural Networks and Deep Learning www.cs.wisc.edu/~dpage/cs760/ 1 Goals for the lecture you should understand the following concepts perceptrons the perceptron training rule linear separability hidden

More information

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

2018 EE448, Big Data Mining, Lecture 4. (Part I) Weinan Zhang Shanghai Jiao Tong University

2018 EE448, Big Data Mining, Lecture 4. (Part I) Weinan Zhang Shanghai Jiao Tong University 2018 EE448, Big Data Mining, Lecture 4 Supervised Learning (Part I) Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html Content of Supervised Learning

More information

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 12 Neural Networks 24.11.2016 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de Talk Announcement Yann LeCun (NYU & FaceBook AI)

More information

CSC321 Lecture 5: Multilayer Perceptrons

CSC321 Lecture 5: Multilayer Perceptrons CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 1 / 21 Overview Recall the simple neuron-like unit: y output output bias i'th weight w 1 w2 w3

More information

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau Last update: October 26, 207 Neural networks CMSC 42: Section 8.7 Dana Nau Outline Applications of neural networks Brains Neural network units Perceptrons Multilayer perceptrons 2 Example Applications

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This

More information

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

More information

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav Neural Networks 30.11.2015 Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav 1 Talk Outline Perceptron Combining neurons to a network Neural network, processing input to an output Learning Cost

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information

Lecture 4: Feed Forward Neural Networks

Lecture 4: Feed Forward Neural Networks Lecture 4: Feed Forward Neural Networks Dr. Roman V Belavkin Middlesex University BIS4435 Biological neurons and the brain A Model of A Single Neuron Neurons as data-driven models Neural Networks Training

More information

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning Support Vector Machines. Prof. Matteo Matteucci Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks. Nicholas Ruozzi University of Texas at Dallas Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify

More information

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs) Multilayer Neural Networks (sometimes called Multilayer Perceptrons or MLPs) Linear separability Hyperplane In 2D: w 1 x 1 + w 2 x 2 + w 0 = 0 Feature 1 x 2 = w 1 w 2 x 1 w 0 w 2 Feature 2 A perceptron

More information

text classification 3: neural networks

text classification 3: neural networks text classification 3: neural networks CS 585, Fall 2018 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Artificial Neural Network

Artificial Neural Network Artificial Neural Network Contents 2 What is ANN? Biological Neuron Structure of Neuron Types of Neuron Models of Neuron Analogy with human NN Perceptron OCR Multilayer Neural Network Back propagation

More information

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks COMP 551 Applied Machine Learning Lecture 14: Neural Networks Instructor: Ryan Lowe (ryan.lowe@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted,

More information

Logistic Regression & Neural Networks

Logistic Regression & Neural Networks Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability

More information

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided

More information

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks INFOB2KI 2017-2018 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Artificial Neural Networks Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

Support Vector Machines, Kernel SVM

Support Vector Machines, Kernel SVM Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Artificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso

Artificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso Artificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso xsu@utep.edu Fall, 2018 Outline Introduction A Brief History ANN Architecture Terminology

More information

Support Vector Machines for Classification and Regression

Support Vector Machines for Classification and Regression CIS 520: Machine Learning Oct 04, 207 Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions Chapter ML:VI VI. Neural Networks Perceptron Learning Gradient Descent Multilayer Perceptron Radial asis Functions ML:VI-1 Neural Networks STEIN 2005-2018 The iological Model Simplified model of a neuron:

More information

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture Neural Networks for Machine Learning Lecture 2a An overview of the main types of neural network architecture Geoffrey Hinton with Nitish Srivastava Kevin Swersky Feed-forward neural networks These are

More information

Sections 18.6 and 18.7 Artificial Neural Networks

Sections 18.6 and 18.7 Artificial Neural Networks Sections 18.6 and 18.7 Artificial Neural Networks CS4811 - Artificial Intelligence Nilufer Onder Department of Computer Science Michigan Technological University Outline The brain vs. artifical neural

More information

Intelligent Systems Discriminative Learning, Neural Networks

Intelligent Systems Discriminative Learning, Neural Networks Intelligent Systems Discriminative Learning, Neural Networks Carsten Rother, Dmitrij Schlesinger WS2014/2015, Outline 1. Discriminative learning 2. Neurons and linear classifiers: 1) Perceptron-Algorithm

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6 Slides adapted from Jordan Boyd-Graber, Chris Ketelsen Machine Learning: Chenhao Tan Boulder 1 of 39 HW1 turned in HW2 released Office

More information

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Learning with kernels and SVM

Learning with kernels and SVM Learning with kernels and SVM Šámalova chata, 23. května, 2006 Petra Kudová Outline Introduction Binary classification Learning with Kernels Support Vector Machines Demo Conclusion Learning from data find

More information

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang Deep Feedforward Networks Han Shao, Hou Pong Chan, and Hongyi Zhang Deep Feedforward Networks Goal: approximate some function f e.g., a classifier, maps input to a class y = f (x) x y Defines a mapping

More information

Artificial neural networks

Artificial neural networks Artificial neural networks Chapter 8, Section 7 Artificial Intelligence, spring 203, Peter Ljunglöf; based on AIMA Slides c Stuart Russel and Peter Norvig, 2004 Chapter 8, Section 7 Outline Brains Neural

More information

Artificial Neural Networks The Introduction

Artificial Neural Networks The Introduction Artificial Neural Networks The Introduction 01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001 00100000

More information

EEE 241: Linear Systems

EEE 241: Linear Systems EEE 4: Linear Systems Summary # 3: Introduction to artificial neural networks DISTRIBUTED REPRESENTATION An ANN consists of simple processing units communicating with each other. The basic elements of

More information

Deep Neural Networks (1) Hidden layers; Back-propagation

Deep Neural Networks (1) Hidden layers; Back-propagation Deep Neural Networs (1) Hidden layers; Bac-propagation Steve Renals Machine Learning Practical MLP Lecture 3 4 October 2017 / 9 October 2017 MLP Lecture 3 Deep Neural Networs (1) 1 Recap: Softmax single

More information