Perceptron & Neural Networks

Size: px
Start display at page:

Download "Perceptron & Neural Networks"

Transcription

1 School of Computer Science Introduction to Machine Learning Perceptron & Neural Networks Readings: Bishop Ch , Ch. 5 Murphy Ch. 16.5, Ch. 28 Mitchell Ch. 4 Matt Gormley Lecture 12 October 17,

2 Reminders Homework 3: due 10/24/16 2

3 Outline Discriminative vs. Generative Perceptron Neural Networks Backpropagation 3

4 DISCRIMINATIVE AND GENERATIVE CLASSIFIERS 4

5 Generative vs. Discriminative Generative Classifiers: Example: Naïve Bayes Define a joint model of the observations x and the labels y: p(x,y) Learning maximizes (joint) likelihood Use Bayes Rule to classify based on the posterior: p(y x) =p(x y)p(y)/p(x) Discriminative Classifiers: Example: Logistic Regression Directly model the conditional: p(y x) Learning maximizes conditional likelihood 5

6 Generative vs. Discriminative Finite Sample Analysis (Ng & Jordan, 2002) [Assume that we are learning from a finite training dataset] If model assumptions are correct: Naive Bayes is a more efficient learner (requires fewer samples) than Logistic Regression If model assumptions are incorrect: Logistic Regression has lower asymtotic error, and does better than Naïve Bayes 6

7 solid: NB dashed: LR Slide courtesy of William Cohen 7

8 solid: NB dashed: LR Naïve Bayes makes stronger assumptions about the data but needs fewer examples to estimate the parameters On Discriminative vs Generative Classifiers:. Andrew Ng and Michael Jordan, NIPS Slide courtesy of William Cohen 8

9 Generative vs. Discriminative Learning (Parameter Estimation) Naïve Bayes: Parameters are decoupled à Closed form solution for MLE Logistic Regression: Parameters are coupled à No closed form solution must use iterative optimization techniques instead 9

10 Naïve Bayes vs. Logistic Reg. Learning (MAP Estimation of Parameters) Bernoulli Naïve Bayes: Parameters are probabilities à Beta prior (usually) pushes probabilities away from zero / one extremes Logistic Regression: Parameters are not probabilities à Gaussian prior encourages parameters to be close to zero (effectively pushes the probabilities away from zero / one extremes) 10

11 Features Naïve Bayes vs. Logistic Reg. Naïve Bayes: Features x are assumed to be conditionally independent given y. (i.e. Naïve Bayes Assumption) Logistic Regression: No assumptions are made about the form of the features x. They can be dependent and correlated in any fashion. 11

12 THE PERCEPTRON ALGORITHM 12

13 Recall Background: Hyperplanes Why don t we drop the generative model and try to learn this hyperplane directly?

14 Background: Hyperplanes Recall w Hyperplane (Definition 1): H = {x : w T x = b} Hyperplane (Definition 2): H = {x : w T x =0 and x 1 =1} Half- spaces: H + = {x : w T x > 0 and x 1 =1} H = {x : w T x < 0 and x 1 =1}

15 Background: Hyperplanes Directly modeling the hyperplane would use a decision function: for: h( ) =sign( T ) y { 1, +1} Recall Why don t we drop the generative model and try to learn this hyperplane directly?

16 Online Learning Model Setup: We receive an example (x, y) Make a prediction h(x) Check for correctness h(x) = y? Goal: Minimize the number of mistakes 16

17 Margins Recall Definition: The margin of example x w.r.t. a linear sep. w is the distance from x to the plane w x = 0 (or the negative if on wrong side) Definition: The margin γ w of a set of examples S wrt a linear separator w is the smallest margin over points x S. Definition: The margin γ of a set of examples S is the maximum γ w over all linear separators w. Slide from Nina Balcan γ γ w

18 Perceptron Algorithm Data: Inputs are continuous vectors of length K. Outputs are discrete. D = { (i),y (i) } N i=1 where R K and y {+1, 1} Prediction: Output determined by hyperplane. ŷ = h (x) = sign( T x) sign(a) = 1, if a 0 1, otherwise Learning: Iterative procedure: while not converged receive next example (x, y) predict y = h(x) if positive mistake: add x to parameters if negative mistake: subtract x from parameters 18

19 Perceptron Algorithm Learning: Algorithm 1 Perceptron Learning Algorithm (Batch) 1: procedure P (D = {( (1),y (1) ),...,( (N),y (N) )}) 2: 0 Initialize parameters 3: while not converged do 4: for i {1, 2,...,N} do For each example 5: ŷ sign( T (i) ) Predict 6: if ŷ = y (i) then If mistake 7: + y (i) (i) Update parameters 8: return 19

20 Analysis: Perceptron Perceptron Mistake Bound Guarantee: If data has margin and all points inside a ball of radius R, then Perceptron makes (R/ ) 2 mistakes. (Normalized margin: multiplying all points by 100, or dividing all points by 100, doesn t change the number of mistakes; algo is invariant to scaling.) Slide from Nina Balcan - - γ γ R 20

21 Analysis: Perceptron Perceptron Mistake Bound Theorem 0.1 (Block (1962), Novikoff (1962)). Given dataset: D = {( (i),y (i) )} N i=1. Suppose: 1. Finite size inputs: x (i) R 2. Linearly separable data: s.t. =1and y (i) ( (i) ), i Then: The number of mistakes made by the Perceptron + algorithm on this dataset is + + k (R/ ) γ γ R + + Figure from Nina Balcan 21

22 Analysis: Perceptron Proof of Perceptron Mistake Bound: We will show that there exist constants A and B s.t. Ak (k+1) B k Ak (k+1) B k Ak 22

23 Analysis: Perceptron Theorem 0.1 (Block (1962), Novikoff (1962)). Given dataset: D = {( (i),y (i) )} N i=1. Suppose: 1. Finite size inputs: x (i) R 2. Linearly separable data: s.t. =1and y (i) ( (i) ), i Then: The number of mistakes made by the Perceptron algorithm on this dataset is γ γ R + k (R/ ) 2 Algorithm 1 Perceptron Learning Algorithm (Online) 1: procedure P (D = {( (1),y (1) ), ( (2),y (2) ),...}) 2: 0, k =1 Initialize parameters 3: for i {1, 2,...} do For each example 4: if y (i) ( (k) (i) ) 0 then If mistake 5: (k+1) (k) + y (i) (i) Update parameters 6: k k +1 7: return 23

24 Analysis: Perceptron Whiteboard: Proof of Perceptron Mistake Bound 24

25 Analysis: Perceptron Proof of Perceptron Mistake Bound: Part 1: for some A, Ak (k+1) (k+1) =( (k) + y (i) (i) ) by Perceptron algorithm update = (k) + y (i) ( (i) ) (k) + by assumption (k+1) k by induction on k since (1) = 0 (k+1) k since and =1 Cauchy- Schwartz inequality 25

26 Analysis: Perceptron Proof of Perceptron Mistake Bound: Part 2: for some B, (k+1) B k (k+1) 2 = (k) + y (i) (i) 2 by Perceptron algorithm update = (k) 2 +(y (i) ) 2 (i) 2 +2y (i) ( (k) (i) ) (k) 2 +(y (i) ) 2 (i) 2 since kth mistake y (i) ( (k) (i) ) 0 = (k) 2 + R 2 since (y (i) ) 2 (i) 2 = (i) 2 = R 2 by assumption and (y (i) ) 2 =1 (k+1) 2 kr 2 by induction on k since ( (1) ) 2 =0 (k+1) kr 26

27 Analysis: Perceptron Proof of Perceptron Mistake Bound: Part 3: Combining the bounds finishes the proof. k k (R/ ) 2 (k+1) kr The total number of mistakes must be less than this 27

28 Extensions of Perceptron Kernel Perceptron Choose a kernel K(x, x) Apply the kernel trick to Perceptron Resulting algorithm is still very simple Structured Perceptron Basic idea can also be applied when y ranges over an exponentially large set Mistake bound does not depend on the size of that set 28

29 Summary: Perceptron Perceptron is a simple linear classifier Simple learning algorithm: when a mistake is made, add / subtract the features For linearly separable and inseparable data, we can bound the number of mistakes (geometric argument) Extensions support nonlinear separators and structured prediction 29

30 RECALL: LOGISTIC REGRESSION 30

31 Using gradient ascent for linear classifiers Key idea behind today s lecture: 1. Define a linear classifier (logistic regression) 2. Define an objective function (likelihood) 3. Optimize it with gradient descent to learn parameters Recall 4. Predict the class with highest probability under the model 31

32 Using gradient ascent for linear classifiers Recall This decision function isn t differentiable: h( ) =sign( T ) Use a differentiable function instead: p (y =1 ) = 1 1+ ( T ) sign(x) logistic(u) 1 1+ e u 32

33 Using gradient ascent for linear classifiers Recall This decision function isn t differentiable: h( ) =sign( T ) Use a differentiable function instead: p (y =1 ) = 1 1+ ( T ) sign(x) logistic(u) 1 1+ e u 33

34 Logistic Regression Recall Data: Inputs are continuous vectors of length K. Outputs are discrete. D = { (i),y (i) } N i=1 where R K and y {0, 1} Model: Logistic function applied to dot product of parameters with input vector. p (y =1 ) = Learning: finds the parameters that minimize some objective function. = argmin J( ) 1 1+ ( T ) Prediction: Output is the most probable class. ŷ = y {0,1} p (y ) 34

35 NEURAL NETWORKS 35

36 Learning highly non-linear functions l l l f: X à Y f might be non-linear function X (vector of) continuous and/or discrete vars Y (vector of) continuous and/or discrete vars The XOR gate Speech recognition Eric CMU,

37 Perceptron and Neural Nets l From biological neuron to artificial neuron (perceptron) Synapse Inputs l Dendrites Axon Soma Synapse Soma Synapse Activation function Dendrites Axon x 1 x 2 w 1 w 2 Linear Combiner Hard Limiter θ Threshold Output Y n X = x i w i 1 + 1, if X ω Y = 1, if X < ω i= 0 0 l Artificial neuron networks l l supervised learning gradient descent I n p u t S i g n a l s O u t p u t S i g n a l s Middle Layer Input Layer Output Layer Eric CMU,

38 Connectionist Models l l Consider humans: l l l l l Neuron switching time ~ second Number of neurons ~ Connections per neuron ~ Scene recognition time ~ 0.1 second 100 inference steps doesn't seem like enough à much parallel computation Properties of artificial neural nets (ANN) l l l Many neuron-like threshold switching units Many weighted interconnections among units Highly parallel, distributed processes Eric CMU,

39 Motivation Why is everyone talking about Deep Learning? Because a lot of money is invested in it DeepMind: Acquired by Google for $400 million DNNResearch: Three person startup (including Geoff Hinton) acquired by Google for unknown price tag Enlitic, Ersatz, MetaMind, Nervana, Skylab: Deep Learning startups commanding millions of VC dollars Because it made the front page of the New York Times 39

40 Motivation Why is everyone talking about Deep Learning? 1960s 1980s 1990s Deep learning: Has won numerous pattern recognition competitions Does so with minimal feature engineering This wasn t always the case! Since 1980s: Form of models hasn t changed much, but lots of new tricks More hidden units Better (online) optimization New nonlinear functions (ReLUs) Faster computers (CPUs and GPUs) 40

41 Background A Recipe for Machine Learning 1. Given training data: Face Face Not a face 2. Choose each of these: Decision function Examples: Linear regression, Logistic regression, Neural Network Loss function Examples: Mean- squared error, Cross Entropy 41

42 Background A Recipe for Machine Learning 1. Given training data: 3. Define goal: 2. Choose each of these: Decision function Loss function 4. Train with SGD: (take small steps opposite the gradient) 42

43 Background A Recipe for Gradients Machine Learning 1. Given training data: Backpropagation 3. Define can goal: compute this gradient! And it s a special case of a more general algorithm called reverse- 2. Choose each of these: mode automatic differentiation that Decision function can compute 4. Train the with gradient SGD: of any differentiable (take function small steps efficiently! opposite the gradient) Loss function 43

44 A Recipe for Goals for Today s Machine Lecture Learning Background Given Explore training a new data: class of 3. decision Define functions goal: (Neural Networks) 2. Consider variants of this recipe for training 2. Choose each of these: Decision function Loss function 4. Train with SGD: (take small steps opposite the gradient) 44

45 Decision Functions Output Linear Regression y = h (x) = ( T x) where (a) =a θ 1 θ 2 θ 3 θ M Input 45

46 Decision Functions Logistic Regression Output y = h (x) = ( T x) 1 where (a) = 1+ ( a) θ 1 θ 2 θ 3 θ M Input 46

47 Decision Functions Logistic Regression y = h (x) = ( x) 1 where (a) = 1 + 2tT( a) Output Face θ1 Input T θ2 θ3 Face Not a face θm 47

48 Decision Functions Output Logistic Regression y = h (x) = ( T x) 1 where (a) = 1+ ( a) y θ 1 θ 2 θ 3 θ M x 2 x 1 Input 48

49 Decision Functions Logistic Regression Output y = h (x) = ( T x) 1 where (a) = 1+ ( a) θ 1 θ 2 θ 3 θ M Input 49

50 Neural Network Model Inputs Age 34 Gender 2 Stage Σ Σ Σ Output 0.6 Probability of beingalive Independent variables Weights HiddenL ayer Weights Dependent variable Prediction Eric CMU,

51 Combined logistic models Inputs Age 34.6 Output Gender 2 Stage Σ 0.6 Probability of beingalive Independent variables Weights HiddenL ayer Weights Dependent variable Prediction Eric CMU,

52 Inputs Age 34 Gender 2 Stage Σ Output 0.6 Probability of beingalive Independent variables Weights HiddenL ayer Weights Dependent variable Prediction Eric CMU,

53 Inputs Age 34 Gender 1 Stage Σ Output 0.6 Probability of beingalive Independent variables Weights HiddenL ayer Weights Dependent variable Prediction Eric CMU,

54 Not really, no target for hidden units... Age 34 Gender 2 Stage Σ Σ Σ 0.6 Probability of beingalive Independent variables Weights HiddenL ayer Weights Dependent variable Prediction Eric CMU,

55 Jargon Pseudo-Correspondence l l l l Independent variable = input variable Dependent variable = output variable Coefficients = weights Estimates = targets Logistic Regression Model (the sigmoid unit) Inputs Age 34 Gende 1 r Stage Output Σ 0.6 Probability of beingalive Independent variables x1, x2, x3 Coefficients Dependent variable a, b, c p Prediction Eric CMU,

56 Decision Functions Neural Network Output Hidden Layer Input 56

57 Decision Functions Neural Network Output (E) Output (sigmoid) y = 1 1+ ( b) Hidden Layer (D) Output (linear) b = D j=0 jz j Input (C) Hidden (sigmoid) z j = 1 1+ ( a j ), j (B) Hidden (linear) a j = M i=0 jix i, j (A) Input Given x i, i 57

58 Building a Neural Net Output Features 58

59 Building a Neural Net Output Hidden Layer D = M Input 59

60 Building a Neural Net Output Hidden Layer D = M Input 60

61 Building a Neural Net Output Hidden Layer D = M Input 61

62 Building a Neural Net Output Hidden Layer D < M Input 62

63 Decision Boundary 0 hidden layers: linear classifier Hyperplanes y x1 x2 Example from to Eric Postma via Jason Eisner 63

64 Decision Boundary 1 hidden layer y Boundary of convex region (open or closed) x1 x2 Example from to Eric Postma via Jason Eisner 64

65 Decision Boundary y 2 hidden layers Combinations of convex regions x1 x2 Example from to Eric Postma via Jason Eisner 65

66 Decision Functions Multi- Class Output Output Hidden Layer Input 66

67 Decision Functions Deeper Networks Next lecture: Output Hidden Layer 1 Input 67

68 Decision Functions Deeper Networks Next lecture: Output Hidden Layer 2 Hidden Layer 1 Input 68

69 Decision Functions Next lecture: Making the neural networks deeper Output Hidden Layer 3 Hidden Layer 2 Deeper Networks Hidden Layer 1 Input 69

70 Decision Functions Different Levels of Abstraction We don t know the right levels of abstraction So let the model figure it out! Example from Honglak Lee (NIPS 2010) 70

71 Decision Functions Different Levels of Abstraction Face Recognition: Deep Network can build up increasingly higher levels of abstraction Lines, parts, regions Example from Honglak Lee (NIPS 2010) 71

72 Decision Functions Different Levels of Abstraction Output Hidden Layer 3 Hidden Layer 2 Hidden Layer 1 Input Example from Honglak Lee (NIPS 2010) 72

73 ARCHITECTURES 73

74 Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make: 1. # of hidden layers (depth) 2. # of units per hidden layer (width) 3. Type of activation function (nonlinearity) 4. Form of objective function 74

75 Activation Functions Neural Network with sigmoid activation functions (F) Loss J = 1 2 (y y )2 Output (E) Output (sigmoid) y = 1 1+ ( b) Hidden Layer (D) Output (linear) b = D j=0 jz j Input (C) Hidden (sigmoid) z j = 1 1+ ( a j ), j (B) Hidden (linear) a j = M i=0 jix i, j (A) Input Given x i, i 75

76 Activation Functions Neural Network with arbitrary nonlinear activation functions (F) Loss J = 1 2 (y y )2 Output (E) Output (nonlinear) y = (b) Hidden Layer (D) Output (linear) b = D j=0 jz j Input (C) Hidden (nonlinear) z j = (a j ), j (B) Hidden (linear) a j = M i=0 jix i, j (A) Input Given x i, i 76

77 Activation Functions Sigmoid / Logistic Function 1 logistic(u) 1+ e u So far, we ve assumed that the activation function (nonlinearity) is always the sigmoid function 77

78 Activation Functions A new change: modifying the nonlinearity The logistic is not widely used in modern ANNs Alternate 1: tanh Like logistic function but shifted to range [- 1, +1] Slide from William Cohen

79 AI Stats 2010 depth 4? sigmoid vs. tanh Figure from Glorot & Bentio (2010)

80 Activation Functions A new change: modifying the nonlinearity relu often used in vision tasks Alternate 2: rectified linear unit Linear with a cutoff at zero (Implementation: clip the gradient when you pass zero) Slide from William Cohen

81 Activation Functions A new change: modifying the nonlinearity relu often used in vision tasks Alternate 2: rectified linear unit Soft version: log(exp(x)+1) Doesn t saturate (at one end) Sparsifies outputs Helps with vanishing gradient Slide from William Cohen

82 Objective Functions for NNs Regression: Use the same objective as Linear Regression Quadratic loss (i.e. mean squared error) Classification: Use the same objective as Logistic Regression Cross- entropy (i.e. negative log likelihood) This requires probabilities, so we add an additional softmax layer at the end of our network Forward Backward Quadratic J = 1 2 (y y )2 dj dy = y Cross Entropy J = y (y)+(1 y ) (1 y) y dj dy = y 1 y +(1 y ) 1 y 1 82

83 Multi- Class Output Output Hidden Layer Input 83

84 Multi- Class Output Softmax: y k = (b k ) K l=1 (b l) (F) Loss J = K k=1 y k (y k) (E) Output (softmax) y k = (b k) K l=1 (b l) (D) Output (linear) b k = D j=0 kjz j k Output Hidden Layer (C) Hidden (nonlinear) z j = (a j ), j (B) Hidden (linear) a j = M i=0 jix i, j Input (A) Input Given x i, i 84

85 Cross- entropy vs. Quadratic loss Figure from Glorot & Bentio (2010)

86 Background A Recipe for Machine Learning 1. Given training data: 3. Define goal: 2. Choose each of these: Decision function Loss function 4. Train with SGD: (take small steps opposite the gradient) 86

87 Objective Functions Matching Quiz: Suppose you are given a neural net with a single output, y, and one hidden layer. 1) Minimizing sum of squared errors 2) Minimizing sum of squared errors plus squared Euclidean norm of weights 3) Minimizing cross- entropy 4) Minimizing hinge loss gives 5) MLE estimates of weights assuming target follows a Bernoulli with parameter given by the output value 6) MAP estimates of weights assuming weight priors are zero mean Gaussian 7) estimates with a large margin on the training data 8) MLE estimates of weights assuming zero mean Gaussian noise on the output value A. 1=5, 2=7, 3=6, 4=8 B. 1=5, 2=7, 3=8, 4=6 C. 1=7, 2=5, 3=5, 4=7 D. 1=7, 2=5, 3=6, 4=8 E. 1=8, 2=6, 3=5, 4=7 F. 1=8, 2=6, 3=8, 4=6 87

88 BACKPROPAGATION 88

89 Background A Recipe for Machine Learning 1. Given training data: 3. Define goal: 2. Choose each of these: Decision function Loss function 4. Train with SGD: (take small steps opposite the gradient) 89

90 Training Backpropagation Question 1: When can we compute the gradients of the parameters of an arbitrary neural network? Question 2: When can we make the gradient computation efficient? 90

91 Training { That is, the computation Given: : y = g(u) and u = h(x). es. Chain Rule: dy i dx k = JX j=1 dy i du j du j dx k, Chain Rule 8i, k 91

92 Training { That is, the computation Given: : y = g(u) and u = h(x). es. Chain Rule: dy i dx k = JX j=1 dy i du j du j dx k, Chain Rule 8i, k Backpropagation is just repeated application of the chain rule from Calculus

93 Given: Training Chain Rule: { That is, the computation : y = g(u) and u = h(x). es. dy i dx k = JX j=1 dy i du j du j dx k, 8i, k Chain Rule Backpropagation: 1. Instantiate the computation as a directed acyclic graph, where each intermediate quantity is a node 2. At each node, store (a) the quantity computed in the forward pass and (b) the partial derivative of the goal with respect to that node s intermediate quantity. 3. Initialize all partial derivatives to Visit each node in reverse topological order. At each node, add its contribution to the partial derivatives of its parents This algorithm is also called automatic differentiation in the reverse- mode 93

94 Training Backpropagation Simple Example: The goal is to compute J = ( (x 2 )+3x 2 ) on the forward pass and the derivative dj dx on the backward pass. Forward J = cos(u) u = u 1 + u 2 u 1 = sin(t) u 2 =3t t = x 2 94

95 Training Backpropagation Simple Example: The goal is to compute J = ( (x 2 )+3x 2 ) on the forward pass and the derivative dj dx on the backward pass. Forward J = cos(u) u = u 1 + u 2 u 1 = sin(t) u 2 =3t t = x 2 Backward dj du = sin(u) dj = dj du, du 1 du du 1 dj dj du 1 = dt du 1 dt, du 1 dt dj dj du 2 = dt du 2 dt, du 2 dt dj dj dt = dx dt dx, du du 1 =1 = (t) =3 dt dx =2x dj = dj du 2 du du du 2, du du 2 =1 95

96 Training Backpropagation Case 1: Logistic Regression Output Input θ 1 θ 2 θ 3 θ M Forward J = y y +(1 y ) (1 y) y = a = 1 1+ ( a) D j=0 jx j Backward dj dy = y y + (1 y ) y 1 dj da = dj dy dy da, dy da = ( a) ( ( a) + 1) 2 dj = dj d j da dj = dj dx j da da d j, da dx j, da d j = x j da dx j = j 96

97 Training Backpropagation Output (E) Output (sigmoid) y = 1 1+ ( b) Hidden Layer (D) Output (linear) b = D j=0 jz j Input (C) Hidden (sigmoid) z j = 1 1+ ( a j ), j (B) Hidden (linear) a j = M i=0 jix i, j (A) Input Given x i, i 97

98 Training Backpropagation (F) Loss J = 1 2 (y y )2 Output (E) Output (sigmoid) y = 1 1+ ( b) Hidden Layer (D) Output (linear) b = D j=0 jz j Input (C) Hidden (sigmoid) z j = 1 1+ ( a j ), j (B) Hidden (linear) a j = M i=0 jix i, j (A) Input Given x i, i 98

99 Training Backpropagation Case 2: Neural Network Forward J = y y +(1 y ) (1 y) y = b = z j = a j = 1 1+ ( b) D j=0 jz j 1 1+ ( a j ) M i=0 jix i Backward dj dy = y y + (1 y ) y 1 dj db = dj dy dy db, dy db = ( b) ( ( b) + 1) 2 dj = dj d j db dj = dj dz j db db d j, db dz j, db d j = z j db dz j = dj = dj dz j, dz j = da j dz j da j da j dj = dj d ji da j da j d ji, dj = dj da j, da j = dx i da j dx i dx i j ( a j ) ( ( a j ) + 1) 2 da j d ji = x i D j=0 ji 99

100 Given: Training Chain Rule: { That is, the computation : y = g(u) and u = h(x). es. dy i dx k = JX j=1 dy i du j du j dx k, 8i, k Chain Rule Backpropagation: 1. Instantiate the computation as a directed acyclic graph, where each intermediate quantity is a node 2. At each node, store (a) the quantity computed in the forward pass and (b) the partial derivative of the goal with respect to that node s intermediate quantity. 3. Initialize all partial derivatives to Visit each node in reverse topological order. At each node, add its contribution to the partial derivatives of its parents This algorithm is also called automatic differentiation in the reverse- mode 100

101 Given: Training Chain Rule: { That is, the computation : y = g(u) and u = h(x). es. dy i dx k = JX j=1 dy i du j du j dx k, 8i, k Chain Rule Backpropagation: 1. Instantiate the computation as a directed acyclic graph, where each node represents a Tensor. 2. At each node, store (a) the quantity computed in the forward pass and (b) the partial derivatives of the goal with respect to that node s Tensor. 3. Initialize all partial derivatives to Visit each node in reverse topological order. At each node, add its contribution to the partial derivatives of its parents This algorithm is also called automatic differentiation in the reverse- mode 101

102 Training Backpropagation Case 2: Neural Module 5 Network Module 4 Module 3 Module 2 Module 1 Forward J = y y +(1 y ) (1 y) y = b = z j = a j = 1 1+ ( b) D j=0 jz j 1 1+ ( a j ) M i=0 jix i Backward dj dy = y y + (1 y ) y 1 dj db = dj dy dy db, dy db = ( b) ( ( b) + 1) 2 dj = dj d j db dj = dj dz j db db d j, db dz j, db d j = z j db dz j = dj = dj dz j, dz j = da j dz j da j da j dj = dj d ji da j da j d ji, dj = dj da j, da j = dx i da j dx i dx i j ( a j ) ( ( a j ) + 1) 2 da j d ji = x i D j=0 ji 102

103 Background A Recipe for Gradients Machine Learning 1. Given training data: Backpropagation 3. Define can goal: compute this gradient! And it s a special case of a more general algorithm called reverse- 2. Choose each of these: mode automatic differentiation that Decision function can compute 4. Train the with gradient SGD: of any differentiable (take function small steps efficiently! opposite the gradient) Loss function 103

104 Summary 1. Neural Networks provide a way of learning features are highly nonlinear prediction functions (can be) a highly parallel network of logistic regression classifiers discover useful hidden representations of the input 2. Backpropagation provides an efficient way to compute gradients is a special case of reverse- mode automatic differentiation 104

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 12 Feb 23, 2018 1 Neural Networks Outline

More information

Deep Learning (CNNs)

Deep Learning (CNNs) 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Deep Learning (CNNs) Deep Learning Readings: Murphy 28 Bishop - - HTF - - Mitchell

More information

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 13 Mar 1, 2018 1 Reminders Homework 5: Neural

More information

Perceptron (Theory) + Linear Regression

Perceptron (Theory) + Linear Regression 10601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron (Theory) Linear Regression Matt Gormley Lecture 6 Feb. 5, 2018 1 Q&A

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework

More information

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Neural Networks Le Song Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU Reading: Chap. 5 CB Learning highly non-linear functions f:

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 9 Sep. 26, 2018 1 Reminders Homework 3:

More information

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1 Reminders Homework

More information

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Logistic Regression. William Cohen

Logistic Regression. William Cohen Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting

More information

Logistic Regression & Neural Networks

Logistic Regression & Neural Networks Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability

More information

Neural Networks: Introduction

Neural Networks: Introduction Neural Networks: Introduction Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others 1

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35 Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How

More information

CSC 411 Lecture 10: Neural Networks

CSC 411 Lecture 10: Neural Networks CSC 411 Lecture 10: Neural Networks Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 10-Neural Networks 1 / 35 Inspiration: The Brain Our brain has 10 11

More information

Artificial Neuron (Perceptron)

Artificial Neuron (Perceptron) 9/6/208 Gradient Descent (GD) Hantao Zhang Deep Learning with Python Reading: https://en.wikipedia.org/wiki/gradient_descent Artificial Neuron (Perceptron) = w T = w 0 0 + + w 2 2 + + w d d where

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler + Machine Learning and Data Mining Multi-layer Perceptrons & Neural Networks: Basics Prof. Alexander Ihler Linear Classifiers (Perceptrons) Linear Classifiers a linear classifier is a mapping which partitions

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

More information

ECE521 Lecture 7/8. Logistic Regression

ECE521 Lecture 7/8. Logistic Regression ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression

More information

Course 395: Machine Learning - Lectures

Course 395: Machine Learning - Lectures Course 395: Machine Learning - Lectures Lecture 1-2: Concept Learning (M. Pantic) Lecture 3-4: Decision Trees & CBC Intro (M. Pantic & S. Petridis) Lecture 5-6: Evaluating Hypotheses (S. Petridis) Lecture

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Lecture 4: Deep Learning Essentials Pierre Geurts, Gilles Louppe, Louis Wehenkel 1 / 52 Outline Goal: explain and motivate the basic constructs of neural networks. From linear

More information

Bayesian Networks (Part I)

Bayesian Networks (Part I) 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Networks (Part I) Graphical Model Readings: Murphy 10 10.2.1 Bishop 8.1,

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Linear discriminant functions

Linear discriminant functions Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) Human Brain Neurons Input-Output Transformation Input Spikes Output Spike Spike (= a brief pulse) (Excitatory Post-Synaptic Potential)

More information

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 2012 Engineering Part IIB: Module 4F10 Introduction In

More information

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer CSE446: Neural Networks Spring 2017 Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer Human Neurons Switching time ~ 0.001 second Number of neurons 10 10 Connections per neuron 10 4-5 Scene

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

CSC242: Intro to AI. Lecture 21

CSC242: Intro to AI. Lecture 21 CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

CSC321 Lecture 5: Multilayer Perceptrons

CSC321 Lecture 5: Multilayer Perceptrons CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 1 / 21 Overview Recall the simple neuron-like unit: y output output bias i'th weight w 1 w2 w3

More information

Machine Learning Lecture 10

Machine Learning Lecture 10 Machine Learning Lecture 10 Neural Networks 26.11.2018 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Today s Topic Deep Learning 2 Course Outline Fundamentals Bayes

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17 3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural

More information

Artificial Neural Networks

Artificial Neural Networks Artificial Neural Networks 鮑興國 Ph.D. National Taiwan University of Science and Technology Outline Perceptrons Gradient descent Multi-layer networks Backpropagation Hidden layer representations Examples

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Machine Learning

Machine Learning Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 1, 2011 Today: Generative discriminative classifiers Linear regression Decomposition of error into

More information

Artificial Neural Networks. MGS Lecture 2

Artificial Neural Networks. MGS Lecture 2 Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation

More information

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks COMP 551 Applied Machine Learning Lecture 14: Neural Networks Instructor: Ryan Lowe (ryan.lowe@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted,

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes MLE / MAP Readings: Estimating Probabilities (Mitchell, 2016)

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

Neural Networks in Structured Prediction. November 17, 2015

Neural Networks in Structured Prediction. November 17, 2015 Neural Networks in Structured Prediction November 17, 2015 HWs and Paper Last homework is going to be posted soon Neural net NER tagging model This is a new structured model Paper - Thursday after Thanksgiving

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others

More information

Deep Neural Networks (1) Hidden layers; Back-propagation

Deep Neural Networks (1) Hidden layers; Back-propagation Deep Neural Networs (1) Hidden layers; Bac-propagation Steve Renals Machine Learning Practical MLP Lecture 3 4 October 2017 / 9 October 2017 MLP Lecture 3 Deep Neural Networs (1) 1 Recap: Softmax single

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 1 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

CSC321 Lecture 6: Backpropagation

CSC321 Lecture 6: Backpropagation CSC321 Lecture 6: Backpropagation Roger Grosse Roger Grosse CSC321 Lecture 6: Backpropagation 1 / 21 Overview We ve seen that multilayer neural networks are powerful. But how can we actually learn them?

More information

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks. Nicholas Ruozzi University of Texas at Dallas Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify

More information

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Learning Deep Architectures for AI. Part I - Vijay Chakilam Learning Deep Architectures for AI - Yoshua Bengio Part I - Vijay Chakilam Chapter 0: Preliminaries Neural Network Models The basic idea behind the neural network approach is to model the response as a

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009 AN INTRODUCTION TO NEURAL NETWORKS Scott Kuindersma November 12, 2009 SUPERVISED LEARNING We are given some training data: We must learn a function If y is discrete, we call it classification If it is

More information

Machine Learning Basics III

Machine Learning Basics III Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

2018 EE448, Big Data Mining, Lecture 5. (Part II) Weinan Zhang Shanghai Jiao Tong University

2018 EE448, Big Data Mining, Lecture 5. (Part II) Weinan Zhang Shanghai Jiao Tong University 2018 EE448, Big Data Mining, Lecture 5 Supervised Learning (Part II) Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html Content of Supervised Learning

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Machine Learning Lecture 12

Machine Learning Lecture 12 Machine Learning Lecture 12 Neural Networks 30.11.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Multilayer Perceptron

Multilayer Perceptron Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Single Perceptron 3 Boolean Function Learning 4

More information

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

More information

The Perceptron Algorithm, Margins

The Perceptron Algorithm, Margins The Perceptron Algorithm, Margins MariaFlorina Balcan 08/29/2018 The Perceptron Algorithm Simple learning algorithm for supervised classification analyzed via geometric margins in the 50 s [Rosenblatt

More information

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks INFOB2KI 2017-2018 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Artificial Neural Networks Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Neural Networks (Part 1) Goals for the lecture

Neural Networks (Part 1) Goals for the lecture Neural Networks (Part ) Mark Craven and David Page Computer Sciences 760 Spring 208 www.biostat.wisc.edu/~craven/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 4, 2015 Today: Generative discriminative classifiers Linear regression Decomposition of error into

More information

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018 1-61 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Regularization Matt Gormley Lecture 1 Feb. 19, 218 1 Reminders Homework 4: Logistic

More information

Machine Learning Basics

Machine Learning Basics Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

Machine Learning (CSE 446): Neural Networks

Machine Learning (CSE 446): Neural Networks Machine Learning (CSE 446): Neural Networks Noah Smith c 2017 University of Washington nasmith@cs.washington.edu November 6, 2017 1 / 22 Admin No Wednesday office hours for Noah; no lecture Friday. 2 /

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Lecture 6. Regression

Lecture 6. Regression Lecture 6. Regression Prof. Alan Yuille Summer 2014 Outline 1. Introduction to Regression 2. Binary Regression 3. Linear Regression; Polynomial Regression 4. Non-linear Regression; Multilayer Perceptron

More information

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Learning Deep Architectures for AI. Part II - Vijay Chakilam Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model

More information

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Networks Matt Gormley Lecture 24 April 9, 2018 1 Homework 7: HMMs Reminders

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA

More information

Deep Neural Networks (1) Hidden layers; Back-propagation

Deep Neural Networks (1) Hidden layers; Back-propagation Deep Neural Networs (1) Hidden layers; Bac-propagation Steve Renals Machine Learning Practical MLP Lecture 3 2 October 2018 http://www.inf.ed.ac.u/teaching/courses/mlp/ MLP Lecture 3 / 2 October 2018 Deep

More information