2018 EE448, Big Data Mining, Lecture 5. (Part II) Weinan Zhang Shanghai Jiao Tong University

Size: px

Start display at page:

Download "2018 EE448, Big Data Mining, Lecture 5. (Part II) Weinan Zhang Shanghai Jiao Tong University"

Gavin Greene
6 years ago
Views:

1 2018 EE448, Big Data Mining, Lecture 5 Supervised Learning (Part II) Weinan Zhang Shanghai Jiao Tong University

2 Content of Supervised Learning Introduction to Machine Learning Linear Models Support Vector Machines Neural Networks Tree Models Ensemble Methods

3 Content of This Lecture Support Vector Machines Neural Networks

4 Linear Classification For linear separable cases, we have multiple decision boundaries

5 Linear Classification For linear separable cases, we have multiple decision boundaries Ruling out some separators by considering data noise

6 Linear Classification For linear separable cases, we have multiple decision boundaries The intuitive optimal decision boundary: the largest margin

L(y;x; p μ )= ylog ¾(μ > x) (1 y)log(1 ¾(μ > x)) ¾(x) @L(y; x; p μ ) 1 = y @μ ¾(μ > x) ¾(z)(1

7 Review: Logistic Regression Logistic regression is a binary classification model p μ (y =1jx) =¾(μ > x)= p μ (y =0jx) = e μ> x 1+e μ> x 1 1+e μ> x Cross entropy loss function Gradient L(y;x; p μ )= ylog ¾(μ > x) (1 y)log(1 ¾(μ > x)) x; p μ ) 1 = ¾(μ > x) ¾(z)(1 ¾(z))x (1 y) 1 1 ¾(μ > ¾(z)(1 ¾(z))x x) =(¾(μ > x) μ Ã μ + (y ¾(μ > x))x = ¾(z)(1 x

8 Label Decision Logistic regression provides the probability p μ (y =1jx) =¾(μ > x)= p μ (y =0jx) = e μ> x 1+e μ> x The final label of an instance is decided by setting a threshold h ^y = ( 1; p μ (y =1jx) >h 0; otherwise 1 1+e μ> x

9 Logistic Regression Scores x 2 s(x) =μ 0 + μ 1 x 1 + μ 2 x 2 p μ (y =1jx) = 1 1+e s(x) s(x) =μ 0 + μ 1 x (A) 1 + μ 2 x (A) 2 Decision boundary s(x) =μ 0 + μ 1 x (B) 1 + μ 2 x (B) 2 s(x) =μ 0 + μ 1 x (C) 1 + μ 2 x (C) 2 s(x) =0 The higher score, the larger distance to the decision boundary, the higher confidence x 1 Example from Andrew Ng

10 Linear Classification The intuitive optimal decision boundary: the highest confidence

11 Notations for SVMs Feature vector Class label Parameters Intercept b Feature weight vector Label prediction x y 2f 1; 1g w h w;b (x) =g(w > x + b) ( +1 z 0 g(z) = 1 otherwise

12 Logistic Regression Scores x 2 s(x) =b + w 1 x 1 + w 2 x 2 p μ (y =1jx) = 1 1+e s(x) s(x) =b + w 1 x (A) 1 + w 2 x (A) 2 Decision boundary s(x) =b + w 1 x (B) 1 + w 2 x (B) 2 s(x) =b + w 1 x (C) 1 + w 2 x (C) 2 s(x) =0 The higher score, the larger distance to the separating hyperplane, the higher confidence x 1 Example from Andrew Ng

13 Margins x 2 Functional margin ^ (i) = y (i) (w > x (i) + b) (B) w Note that the separating hyperplane won t change with the magnitude of (w, b) g(w > x + b) =g(2w > x +2b) Decision boundary Geometric margin (i) = y (i) (w > x (i) + b) where kwk 2 =1 x 1

14 Margins x 2 Decision boundary μ w > x (i) (i) y (i) w kwk + b =0 (B) w ) (i) = y (i) w> x (i) + b kwk μ ³ w >x = y (i) (i) + kwk b kwk Given a training set Decision boundary x 1 S = f(x i ;y i )g i=1:::m the smallest geometric margin = min i=1:::m (i)

15 Objective of an SVM Find a separable hyperplane that maximizes the minimum geometric margin max ;w;b s.t. y (i) (w > x (i) + b) ; i =1;:::;m kwk =1 (non-convex constraint) Equivalent to normalized functional margin max ^ ;w;b ^ kwk (non-convex objective) s.t. y (i) (w > x (i) + b) ^ ; i =1;:::;m

16 Objective of an SVM Functional margin scales w.r.t. (w,b) without changing the decision boundary. Let s fix the functional margin at 1. Objective is written as max w;b Equivalent with 1 kwk ^ =1 s.t. y (i) (w > x (i) + b) 1; i=1;:::;m min w;b 1 2 kwk2 s.t. y (i) (w > x (i) + b) 1; i=1;:::;m This optimization problem can be efficiently solved by quadratic programming

17 A Digression of Lagrange Duality in Convex Optimization Boyd, Stephen, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

18 Lagrangian for Convex Optimization A convex optimization problem min w f(w) s.t. h i (w) =0; i =1;:::;l The Lagrangian of this problem is defined as Solving L(w; ) =f(w)+ lx ih i (w) ) Lagrangian multipliers yields the solution of the original optimization problem.

19 Lagrangian for Convex Optimization w 2 f(w) =5 f(w) =4 f(w) =3 h(w) =0 L(w; ) =f(w)+ @w w =0 i.e., two gradients on the same direction

20 With Inequality Constraints A convex optimization problem min w f(w) s.t. g i (w) 0; i =1;:::;k h i (w) =0; i =1;:::;l The Lagrangian of this problem is defined as L(w; ; ) =f(w)+ kx i g i (w)+ i=1 lx ih i (w) i=1 Lagrangian multipliers

21 Primal Problem A convex optimzation min w f(w) s.t. g i (w) 0; i =1;:::;k h i (w) =0; i =1;:::;l The primal problem The Lagrangian kx L(w; ; ) =f(w)+ i g i (w)+ i=1 lx ih i (w) i=1 μ P P(w) = max ; : i 0 L(w; ; ) If a given w violates any constraints, i.e., Then μ P (w) =+1 g i (w) > 0 or h i (w) 6= 0

22 Primal Problem A convex optimzation min w f(w) s.t. g i (w) 0; i =1;:::;k h i (w) =0; i =1;:::;l The primal problem The Lagrangian kx L(w; ; ) =f(w)+ i g i (w)+ i=1 lx ih i (w) i=1 μ P P(w) = max ; : i 0 L(w; ; ) Conversely, if all constraints are satisfied for Then μ P (w) =f(w) μ P (w) = ( f(w) w if w satis es primal constraints +1 otherwise

23 Primal Problem μ P (w) = ( f(w) if w satis es primal constraints +1 otherwise The minimization problem min μ P(w) =min w w max ; : i 0 L(w; ; ) is the same as the original problem min f(w) w s.t. g i (w) 0; i =1;:::;k h i (w) =0; i =1;:::;l Define the value of the primal problem p =min μ P(w) w

24 Dual Problem A slightly different problem Define the dual optimization problem Min & Max exchanged compared to the primal problem min μ P(w) =min w w max ; : i 0 Define the value of the dual problem d d = max min ; : i 0 w L(w; ; ) L(w; ; )

25 Primal Problem vs. Dual Problem d = max min ; : i 0 w L(w; ; ) min w max ; : i 0 L(w; ; ) =p Proof ) max min L(w; ; ) L(w; ; ); 8w; 0; w min ; : 0 w ) max min ; : 0 w L(w; ; ) max ; : 0 L(w; ; ) min w L(w; ; ); 8w max ; : 0 L(w; ; ) But under certain condition d = p

26 Karush-Kuhn-Tucker (KKT) Conditions If f and g i s are convex and h i s are affine, and suppose g i s are all strictly feasible then there must exist w *, α *, β * w * is the solution of the primal problem α *, β * are the solutions of the dual problem and the values of the two problems are equal And w *, α *, β * satisfy the KKT conditions KKT dual complementarity condition Moreover, if some w *, α *, β * satisfy the KKT conditions, then it is also a solution to the primal and dual problems. More details please refer to Boyd Convex optimization 2004.

27 Now Back to SVM Problem

28 Objective of an SVM SVM objective: finding the optimal margin classifier min w;b 1 2 kwk2 s.t. y (i) (w > x (i) + b) 1; i=1;:::;m Re-wright the constraints as g i (w) = y (i) (w > x (i) + b)+1 0 so as to match the standard optimization form min w f(w) s.t. g i (w) 0; h i (w) =0; i =1;:::;k i =1;:::;l

29 Equality Cases g i (w) = y (i) (w > x (i) + b)+1=0 )y (i)³ w > x (i) kwk + b = 1 kwk kwk Geometric margin The g i s = 0 cases correspond to the training examples that have functional margin exactly equal to 1.

30 Objective of an SVM SVM objective: finding the optimal margin classifier min w;b 1 2 kwk2 s.t. y (i) (w > x (i) + b)+1 0; i =1;:::;m Lagrangian L(w; b; ) = 1 2 kwk2 mx i [y (i) (w > x (i) + b) 1] i=1 No β or equality constraints in SVM problem

31 Solving L(w; b; ) = 1 2 kwk2 L(w; b; ) =w X i y (i) x (i) =0 ) w = L(w; b; ) = 1 2 = L(w; b; ) = X i y (i) =0 i=1 mx i [y (i) (w > x (i) + b) 1] i=1 Then Lagrangian is re-written as mx i y (i) x (i) i=1 mx i y (i) x (i) 2 X m i [y (i) (w > x (i) + b) 1] i=1 mx i 1 2 i=1 mx i;j=1 i=1 y (i) y (j) i j x (i)> x (j) b mx i y (i) i=1 = 0

32 Solving α * Dual problem max max μ D( ) =max 0 mx W ( ) = i 1 mx 2 i=1 s.t. i 0; i=1;:::;m mx i y (i) =0 i=1 min 0 w;b i;j=1 L(w; b; ) To solve α * with some methods e.g. SMO We will get back to this solution later y (i) y (j) i j x (i)> x (j)

33 Solving w * and b * With α * solved, w * is obtained by w = mx i y (i) x (i) i=1 Only supporting vectors with α > 0 With w * solved, b * is obtained by b = max i:y (i) = 1 w > x (i) +min i:y (i) =1 w > x (i) 2 w

34 Predicting Values With the solutions of w * and b *, the predicting value (i.e. functional margin) of each instance is w > x + b = = ³ X m i y (i) x (i) > x + b i=1 mx i y (i) hx (i) ;xi + b i=1 We only need to calculate the inner product of x with the supporting vectors

35 Non-Separable Cases The derivation of the SVM as presented so far assumes that the data is linearly separable. More practical cases are linearly non-separable.

36 Dealing with Non-Separable Cases Add slack variables Lagrangian L(w; b;»; ; r) = 1 2 w> w + C Dual problem max W ( ) = min w;b 1 mx 2 kwk2 + C i=1» i L1 regularization s.t. y (i) (w > x (i) + b) 1» i ;i=1;:::;m» i 0; i =1;:::;m mx mx mx» i i [y (i) (x > w + b) 1+» i ] r i» i i=1 i=1 i=1 mx i 1 2 i=1 mx y (i) y (j) i j x (i)> x (j) i;j=1 s.t. 0 i C; i =1;:::;m mx i y (i) =0 i=1 Surprisingly, this is the only change Efficiently solved by SMO algorithm

37 SVM Hinge Loss vs. LR Loss SVM Hinge loss 1 2 kwk2 + C mx max(0; 1 y i (w > x i + b)) i=1 LR log loss y i log ¾(w > x i + b) (1 y i )log(1 ¾(w > x i + b)) If y = 1

38 Coordinate Ascent (Descent) For the optimization problem max W ( 1; 2 ;:::; m ) Coordinate ascent algorithm Loop until convergence: f For i =1;:::;m f i := arg max W ( 1 ;:::; i 1 ; ^ i ; i+1 ;:::; m ) ^ i g g

39 Coordinate Ascent (Descent) A two-dimensional coordinate ascent example

40 SMO Algorithm SMO: sequential minimal optimization SVM optimization problem max W ( ) = mx i 1 2 i=1 mx y (i) y (j) i j x (i)> x (j) i;j=1 s.t. 0 i C; i =1;:::;m mx i y (i) =0 i=1 Cannot directly apply coordinate ascent algorithm because mx i y (i) =0 ) i y (i) = X j y (j) i=1 j6=i

41 SMO Algorithm Update two variable each time Loop until convergence { 1. Select some pair α i and α j to update next 2. Re-optimize W(α) w.r.t. α i and α j } Convergence test: whether the change of W(α) is smaller than a predefined value (e.g. 0.01) Key advantage of SMO algorithm is the update of α i and α j (step 2) is efficient

42 SMO Algorithm max W ( ) = mx i 1 2 i=1 i;j=1 s.t. 0 i C; i =1;:::;m mx i y (i) =0 i=1 Without loss of generality, hold α 3 α m and optimize W(α) w.r.t. α 1 and α 2 1 y (1) + 2 y (2) = mx y (i) y (j) i j x (i)> x (j) mx i y (i) = ³ i=3 ) 2 = y(1) y (2) 1 + ³ y (2) 1 =(³ 2 y (2) )y (1)

43 SMO Algorithm With 1 =(³ 2 y (2) )y (1), the objective is written as W ( 1 ; 2 ;:::; m )=W((³ 2 y (2) )y (1) ; 2 ;:::; m ) Thus the original optimization problem max W ( ) = mx i 1 2 i=1 mx y (i) y (j) i j x (i)> x (j) i;j=1 s.t. 0 i C; i =1;:::;m mx i y (i) =0 i=1 is transformed into a quadratic optimization problem w.r.t. 2 max W ( 2 )=a b 2 + c 2 s.t. 0 2 C

44 SMO Algorithm Optimizing a quadratic function is much efficient W ( 2 ) W ( 2 ) W ( 2 ) 0 b 2a C 2 0 C 2 0 C 2 max W ( 2 )=a b 2 + c 2 s.t. 0 2 C

45 Content of This Lecture Support Vector Machines Neural Networks

46 Breaking News of AI in 2016 AlphaGo wins Lee Sedol (4-1)

Machine Learning in AlphaGo Policy Network Supervised Learning Predict what is the best next human move Reinforcement Learning Learning to

47 Machine Learning in AlphaGo Policy Network Supervised Learning Predict what is the best next human move Reinforcement Learning Learning to select the next move to maximize the winning rate Value Network Expectation of winning given the board state Implemented by (deep) neural networks

48 Neural Networks Neural networks are the basis of deep learning Perceptron Multi-layer Perceptron Convolutional Neural Network Recurrent Neural Network

49 Real Neurons Cell structures Cell body Dendrites Axon Synaptic terminals Slides credit: Ray Mooney

50 Neural Communication Electrical potential across cell membrane exhibits spikes called action potentials. Spike originates in cell body, travels down axon, and causes synaptic terminals to release neurotransmitters. Chemical diffuses across synapse to dendrites of other neurons. Neurotransmitters can be excitatory or inhibitory. If net input of neurotransmitters to a neuron from other neurons is excitatory and exceeds some threshold, it fires an action potential. Slides credit: Ray Mooney

51 Real Neural Learning Synapses change size and strength with experience. Hebbian learning: When two connected neurons are firing at the same time, the strength of the synapse between them increases. Neurons that fire together, wire together. These motivate the research of artificial neural nets Slides credit: Ray Mooney

52 Brief History of Artificial Neural Nets The First wave 1943 McCulloch and Pitts proposed the McCulloch-Pitts neuron model 1958 Rosenblatt introduced the simple single layer networks now called Perceptrons Minsky and Papert s book Perceptrons demonstrated the limitation of single layer perceptrons, and almost the whole field went into hibernation. The Second wave 1986 The Back-Propagation learning algorithm for Multi-Layer Perceptrons was rediscovered and the whole field took off again. The Third wave 2006 Deep (neural networks) Learning gains popularity and 2012 made significant break-through in many applications. Slides credit: Jun Wang

53 Artificial Neuron Model Model network as a graph with cells as nodes and synaptic connections as weighted edges from node i to node j, w ji Model net input to cell as 1 w 12 w 13 w 14w 15 w 16 Cell output is o j = net j = X w ji o i i ( 0 if net j <T j 1 if net j T j o j (T j is threshold for unit j) McCulloch and Pitts [1943] 0 T j net j Slides credit: Ray Mooney

54 Perceptron Model Rosenblatt s single layer perceptron [1958] Rosenblatt [1958] further proposed the perceptron as the first model for learning with a teacher (i.e., supervised learning) Prediction ³ X m ^y = ' w i x i + b i=1 Focused on how to find appropriate weights w m for two-class classification task Activation function ( y = 1: class one 1 if z 0 y = -1: class two '(z) = 1 otherwise

55 Training Perceptron Rosenblatt s single layer perceptron [1958] Training w i = w i + (y ^y)x i b = b + (y ^y) Prediction ³ X m ^y = ' w i x i + b i=1 Activation function ( 1 if z 0 '(z) = 1 otherwise Equivalent to rules: If output is correct, do nothing If output is high, lower weights on positive inputs If output is low, increase weights on active inputs

56 Properties of Perceptron Rosenblatt s single layer perceptron [1958] Rosenblatt proved the x 2 convergence of a learning Class 1 algorithm if two classes said to be linearly separable (i.e., patterns that lie on opposite sides of a hyperplane) Class 2 w 1 x 1 + w 2 x 2 + b =0 x 1 Many people hoped that such a machine could be the basis for artificial intelligence

57 Properties of Perceptron The XOR problem Input x Output y X 1 X 2 X 1 XOR X X1 1 true false false true 0 1 X2 XOR is non linearly separable: These two classes (true and false) cannot be separated using a line. However, Minsky and Papert [1969] showed that some rather elementary computations, such as XOR problem, could not be done by Rosenblatt s one-layer perceptron However Rosenblatt believed the limitations could be overcome if more layers of units to be added, but no learning algorithm known to obtain the weights yet Due to the lack of learning algorithms people left the neural network paradigm for almost 20 years

58 Hidden Layers and Backpropagation (1986~) Adding hidden layer(s) (internal presentation) allows to learn a mapping that is not constrained by linearly separable b class 1 x 1 w 1 y class 2 decision boundary: x 1 w 1 + x 2 w 2 + b = 0 x 1 w 2 b class 2 class 2 class 1 class 2 class 2 Each hidden node realizes one of the lines bounding the convex region x 1 y x 2

59 Hidden Layers and Backpropagation (1986~) But the solution is quite often not unique Input x Output y X 1 X 2 X 1 XOR X (solution 1) (solution 2) Two lines are necessary to divide the sample space accordingly Sign activation function The number in the circle is a threshold

60 Hidden Layers and Backpropagation (1986~) Feedforward: massages move forward from the input nodes, through the hidden nodes (if any), and to the output nodes. There are no cycles or loops in the network Weight Parameters Weight Parameters Two-layer feedforward neural network

61 Single / Multiple Layers of Calculation Single layer function f μ (x) =μ 0 + μ 1 x + μ 2 x 2 f μ (x) =¾(μ 0 + μ 1 x + μ 2 x 2 ) Multiple layer function h 1 (x) =tanh(μ 0 + μ 1 x + μ 2 x 2 ) h 2 (x) =tanh(μ 3 + μ 4 x + μ 5 x 2 ) f μ (x) =f μ (h 1 (x);h 2 (x)) = ¾(μ 6 + μ 7 h 1 + μ 8 h 2 ) x x 2 x 2 h 1 (x) f μ (x) x x 2 x 2 h 2 (x) With non-linear activation function 1 ¾(x) = 1+e x tanh(x) = 1 e 2x 1+e 2x

62 Non-linear Activation Functions Sigmoid Tanh 1 ¾(z) = 1+e z tanh(z) = 1 e 2z 1+e 2z Rectified Linear Unit (ReLU) ReLU(z) =max(0;z)

63 Universal Approximation Theorem A feed-forward network with a single hidden layer containing a finite number of neurons (i.e., a multilayer perceptron), can approximate continuous functions on compact subsets of R n under mild assumptions on the activation function Such as Sigmoid, Tanh and ReLU [Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. "Multilayer feedforward networks are universal approximators." Neural networks 2.5 (1989): ]

64 Universal Approximation Multi-layer perceptron approximate any continuous functions on compact subset of R n 1 ¾(x) = 1+e x tanh(x) = 1 e 2x 1+e 2x

65 Hidden Layers and Backpropagation (1986~) One of the efficient algorithms for multi-layer neural networks is the Backpropagation algorithm It was re-introduced in 1986 and Neural Networks regained the popularity Error backpropagation Parameters weights Parameters weights Error Caculation Note: backpropagation appears to be found by Werbos [1974]; and then independently rediscovered around 1985 by Rumelhart, Hinton, and Williams [1986] and by Parker [1985]

66 Learning NN by Back-Propagation Compare outputs with correct answer to get = k y k [LeCun, Bengio and Hinton. Deep Learning. Nature 2015.]

67 Learning NN by Back-Propagation Error Back-propagation x 1 Parameters weights Parameters weights Error Calculation d 1 =1 x 2 x m y 1 y 0 d 2 = 0 label = Face label = no face Training instances

68 Make a Prediction x 1 x 2 x m w (1) j;m net(1) (1) 1 h (1) 1 inputs outputs labels X f(1) f (1) net (1) 2 h (1) 2 X f(1) f (1) net (1) j h (1) j X f(1) f (1) w (2) k;j net (2) 1 X f(2) f (2) net (2) k X f(2) f (2) y 1 y k d 1 d k Feed-forward prediction: Input layer hidden layer output layer Two-layer feedforward neural network x =(x 1 ;:::;x m ) h (1) j = f (1) (net (1) j )=f (1) ( X m w (1) j;m x m) h (1) j y k = f (2) (net (2) k )=f (2)( X j w (1) k;j h(1) j ) y k where net (1) j = X m w (1) j;m x m j;m x m net (2) k = X j w (2) k;j h(1) j

69 Make a Prediction x 1 x 2 x m w (1) j;m net(1) (1) 1 h (1) 1 inputs outputs labels X f(1) f (1) net (1) 2 h (1) 2 X f(1) f (1) net (1) j h (1) j X f(1) f (1) w (2) k;j net (2) 1 X f(2) f (2) net (2) k X f(2) f (2) y 1 y k d 1 d k Feed-forward prediction: input layer hidden layer output layer Two-layer feedforward neural network x =(x 1 ;:::;x m ) h (1) j = f (1) (net (1) j )=f (1) ( X m w (1) j;m x m) h (1) j y k = f (2) (net (2) k )=f (2)( X j w (1) k;j h(1) j ) y k where net (1) j = X m w (1) j;m x m j;m x m net (2) k = X j w (2) k;j h(1) j

70 Make a Prediction x 1 x 2 x m w (1) j;m net(1) (1) 1 h (1) 1 inputs outputs labels X f(1) f (1) net (1) 2 h (1) 2 X f(1) f (1) net (1) j h (1) j X f(1) f (1) w (2) k;j net (2) 1 X f(2) f (2) net (2) k X f(2) f (2) y 1 y k d 1 d k Feed-forward prediction: Input layer hidden layer output layer Two-layer feedforward neural network x =(x 1 ;:::;x m ) h (1) j = f (1) (net (1) j )=f (1) ( X m w (1) j;m x m) h (1) j y k = f (2) (net (2) k )=f (2)( X j w (1) k;j h(1) j ) y k where net (1) j = X m w (1) j;m x m j;m x m net (2) k = X j w (2) k;j h(1) j

71 When Backprop/Learn Parameters x 1 x 2 x m w (1) j;m net(1) (1) 1 h (1) 1 inputs outputs labels X f(1) f (1) net (1) 2 h (1) 2 X f(1) f (1) net (1) j h (1) j X f(1) f (1) w (2) k;j net (2) 1 X f(2) f (2) net (2) k X f(2) f (2) Input layer hidden layer output layer Two-layer feedforward neural network Notations: net (1) w (1) net (2) j = X m Backprop to learn the parameters w (2) k;j = w(2) k;j + w(2) k;j w (2) (2) k;j w (2) y 1 y k d k y k ± k =(d k y k )f(2) 0 (net(2) j;m x m net (2) k = X j k;j = Error koutput j = ± k h (1) (2) k = = (y k d k (2) k k;j k ) w (2) k;j h j E(W )= 1 2 = (d k y k )f(2) 0 (net(2) k d 1 d k X (y k d k ) 2 k )h(1) j = ± k h (1) j

72 When Backprop/Learn Parameters x 1 x 2 x m w (1) j;m net(1) (1) 1 h (1) 1 inputs outputs labels X f(1) f (1) net (1) 2 h (1) 2 X f(1) f (1) net (1) j h (1) j X f(1) f (1) w (2) k;j net (2) 1 X f(2) f (2) net (2) k X f(2) f (2) Input layer hidden layer output layer Two-layer feedforward neural network Notations: net (1) w (1) net (2) j = X m Backprop to learn the parameters w (1) j;m w (1) j;m = w(1) j;m + w(1) (1) j;m w ) = (1) j y 1 y k d k y k ± k =(d k y k )f(2) 0 (net(2) j;m x m net (2) k = X j k;j = Error joutput m = ± j x (1) (1) j;m = X (d k y k )f(2) 0 (net(2) k k k ) w (2) k;j h j E(W )= 1 2 )w(2) d 1 d k X (y k d k ) 2 k k;j x mf(1) 0 (net(1) j ) = ± j x m

73 An example for Backprop

f (1) '(net j (1) ) k (2) δ k w k, j (1) Δw j,m = ηerror j Output m = ηδ j x m f '

74 An example for Backprop δ k = (d k y k ) f (2) '(net k (2) ) Δw (2) (1) k, j = ηerror k Output j = ηδ k h j Consider sigmoid activation function 1 f Sigmoid (x) = 1+ e x δ j = f (1) '(net j (1) ) k (2) δ k w k, j (1) Δw j,m = ηerror j Output m = ηδ j x m f ' Sigmoid (x) = f Sigmoid (x)(1 f Sigmoid (x))

75 Let us do some calculation Consider the simple network below: Assume that the neurons have a Sigmoid activation function and 1. Perform a forward pass on the network 2. Perform a reverse pass (training) once (target = 0.5) 3. Perform a further forward pass and comment on the result

76 Let us do some calculation

77 A demo from Google

78 Non-linear Activation Functions Sigmoid Tanh 1 ¾(z) = 1+e z tanh(z) = 1 e 2z 1+e 2z Rectified Linear Unit (ReLU) ReLU(z) =max(0;z)

79 Active functions f linear (x) f Sigmoid (x) f tanh (x) f ' linear (x) f ' Sigmoid (x) f ' tanh (x)

80 Activation functions Logistic Sigmoid: 1 f Sigmoid (x) = 1+ e x Its derivative: f ' Sigmoid (x) = f Sigmoid (x)(1 f Sigmoid (x)) Output range [0,1] Motivated by biological neurons and can be interpreted as the probability of an artificial neuron firing given its inputs However, saturated neurons make gradients vanished (why?)

81 Activation functions Tanh function f tanh (x) = sinh(x) cosh(x) = ex e x e x + e x Its gradient: f tanh (x) =1 f tanh (x) 2 Output range [-1,1] Thus strongly negative inputs to the tanh will map to negative outputs. Only zero-valued inputs are mapped to near-zero outputs These properties make the network less likely to get stuck during training

82 Active Functions ReLU (rectified linear unit) f ReLU (x) =max(0;x) The derivative: f ReLU (x) = Another version is Noise ReLU: ( 1 if x>0 0 if x 0 f NoisyReLU (x) =max(0;x+ N(0;±(x))) ReLU can be approximated by softplus function f Softplus (x) =log(1+e x ) google.com/en//pubs/archive/40811.pdf ReLU gradient doesn't vanish as we increase x It can be used to model positive number It is fast as no need for computing the exponential function It eliminates the necessity to have a pretraining phase

Active Functions ReLU (rectified linear unit) f ReLU (x) =max(0;x) ReLU can be approximated by softplus function f Softplus (x) =log(1+e x ) The only non-linearity comes from the path selection with

83 Active Functions ReLU (rectified linear unit) f ReLU (x) =max(0;x) ReLU can be approximated by softplus function f Softplus (x) =log(1+e x ) The only non-linearity comes from the path selection with individual neurons being active or not It allows sparse representations: for a given input only a subset of neurons are active Additional active functions: Leaky ReLU, Exponential LU, Maxout etc Sparse propagation of activations and gradients

84 Error/Loss function Recall stochastic gradient descent Update from a randomly picked example (but in practice do a batch update) w Squared error loss for one binary output: L(w) = 1 2 (y f w(x)) 2 x input f w (x) output

85 Error/Loss function Softmax (cross-entropy loss) for multiple classes L(w) = X k (Class labels follow multinomial distribution) w (1) j;m (d k log ^y k +(1 d k )log(1 y k )) net(1) (1) 1 h (1) 1 X f(1) f (1) net (1) 2 h (1) 2 X f(1) f (1) net (1) j h (1) j w (2) k;j net (2) 1 X f(2) f (2) net (2) k X f(2) f (2) outputs y 1 y k where labels ^y k = d 1 d k ³ P exp j w(2) j P ³ P k 0 exp j w(2) k;j h(1) k 0 ;j h(1) j X f(1) f (1) hidden layer output layer One hot encoded class labels

86 For more machine learning details, you can check out my machine learning course at Zhiyuan College CS420 Machine Learning Weinan Zhang Course webpage:

Support Vector Machines and Kernel Methods

2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University