Advanced Machine Learning

Size: px
Start display at page:

Download "Advanced Machine Learning"

Transcription

1 Advanced Machine Learning Lecture 4: Deep Learning Essentials Pierre Geurts, Gilles Louppe, Louis Wehenkel 1 / 52

2 Outline Goal: explain and motivate the basic constructs of neural networks. From linear discriminant analysis to logistic regresion Stochastic gradient descent From logistic regression to the multi-layer perceptron Vanishing gradients and recti ed networks Universal approximation theorem 2 / 52

3 Deep Learning / 52

4 Threshold Logic Unit The Threshold Logic Unit (McCulloch and Pitts, 1943) was the rst mathematical model for a neuron. Assuming Boolean inputs and outputs, it is de ned as: This unit can implement: or(a, b) = 1 {a+b 0.5 0} and(a, b) = 1 {a+b 1.5 0} not(a) = 1 { a+0.5 0} f(x) = 1 { w x +b 0} i i i Therefore, any Boolean function can be built witch such units. 4 / 52

5 Perceptron The perceptron (Rosenblatt, 1957) is very similar, except that the inputs are real: f(x) = { 1 0 if w x i i i + b 0 otherwise This model was originally motivated by biology, with x and and ring rates. i f This is a cartoonesque biological model. w i being synaptic weights 5 / 52

6 Let us de ne the activation function: σ(x) = { 1 0 if x 0 otherwise Therefore, the perceptron classi cation rule can be rewritten as T f(x) = σ(w x + b). 6 / 52

7 Linear discriminant analysis Consider training data x R p y {0, 1},. (x, y) P X,Y, with Assume class populations are Gaussian, with same covariance matrix (homoscedasticity): 1 1 P (x y) = exp ( (x μ ) Σ (x μ ) (2π) p Σ 2 Σ y T 1 y ) 7 / 52

8 Using the Bayes' rule, we have: P (x Y = 1)P (Y = 1) P (Y = 1 x) = P (x) P (x Y = 1)P (Y = 1) = P (x Y = 0)P (Y = 0) + P (x Y = 1)P (Y = 1) 1 =. P (x Y =0)P (Y =0) 1 + P (x Y =1)P (Y =1) 8 / 52

9 Using the Bayes' rule, we have: P (x Y = 1)P (Y = 1) P (Y = 1 x) = P (x) P (x Y = 1)P (Y = 1) = P (x Y = 0)P (Y = 0) + P (x Y = 1)P (Y = 1) 1 =. P (x Y =0)P (Y =0) 1 + P (x Y =1)P (Y =1) It follows that with 1 σ(x) =, 1 + exp( x) we get ( log P (x Y = 1) P (Y = 1) + log ) P (x Y = 0) P (Y = 0) P (Y = 1 x) = σ. 9 / 52

10 Therefore, P (Y = 1 x) = σ P (x Y = 1) log + log P (x Y = 0) P (Y = 1) P (Y = 0) a = σ ( log P (x Y = 1) log P (x Y = 0) + a) 1 = σ ( (x μ ) Σ (x μ ) + (x μ ) Σ (x μ 1 T ) + a 2 2 = σ (μ μ 1 0) T Σ x + w T = σ ( w T x + b) b 0 T 1 0 ) 1 T 1 (μ Σ μ μ Σ μ T 1 1) + a 2 10 / 52

11 11 / 52

12 11 / 52

13 11 / 52

14 Note that the sigmoid function σ(x) = exp( x) looks like a soft heavyside: Therefore, the overall model perceptron. T f(x; w, b) = σ(w x + b) is very similar to the 12 / 52

15 In terms of tensor operations, the computational graph of f can be represented as: where white nodes correspond to inputs and outputs; red nodes correspond to model parameters; blue nodes correspond to intermediate operations, which themselves produce intermediate output values (not represented). This unit is the core component all neural networks! 13 / 52

16 Logistic regression Same model as for linear discriminant analysis. But, ignore model assumptions (Gaussian class populations, homoscedasticity); instead, nd w, b P (Y = 1 x) = σ ( w T x + b) that maximizes the likelihood of the data. 14 / 52

17 We have, arg max P (d w, b) w,b = arg max P (Y = y x, w, b) w,b i i x i,y i d = arg max σ(w x + b) (1 σ(w x + b)) = arg w,b min w,b x i,y i d T y i T 1 y i i i T i i T i x i,y i d y log σ(w x + b) (1 y ) log(1 σ(w x + b)) L(w,b)= i l(y i, y^ (x i;w,b)) i This loss is an instance of the cross-entropy p = Y x i q = Y^ x for and. i H(p, q) = E [ log q] p 15 / 52

18 Y { 1, 1} When takes values in, a similar derivation yields the logistic loss L(w, b) = log σ y T ( i (w x i + b)) ). x i,y i d In general, the cross-entropy and the logistic losses do not admit a minimizer that can be expressed analytically in closed form. However, a minimizer can be found numerically, using a general minimization technique such as gradient descent. 16 / 52

19 Gradient descent L(θ) θ w b Let denote a loss function de ned over model parameters (e.g., and ). L(θ) To minimize, gradient descent uses local linear information to iteratively move towards a (local) minimum. θ R For, a rst-order approximation around can be de ned as 0 d L^ 0 0 T (θ + ϵ) = L(θ ) + ϵ θl(θ 0) + ϵ. 2γ θ / 52

20 A minimizer of the approximation is given for which results in the best improvement for the step. Therefore, model parameters can be updated iteratively using the update rule: Notes: θ 0 γ are the initial parameters of the model; is the learning rate; L^ (θ 0 + ϵ) ϵl^ (θ 0 + ϵ) = 0 1 = L(θ ) + θ 0 ϵ, γ θ = θ γ L(θ ) t+1 t θ t both are critical for the convergence of the update rule. ϵ = γ L(θ ) θ 0 18 / 52

21 Example 1: Convergence to a local minima 19 / 52

22 Example 1: Convergence to a local minima 19 / 52

23 Example 1: Convergence to a local minima 19 / 52

24 Example 1: Convergence to a local minima 19 / 52

25 Example 1: Convergence to a local minima 19 / 52

26 Example 1: Convergence to a local minima 19 / 52

27 Example 1: Convergence to a local minima 19 / 52

28 Example 1: Convergence to a local minima 19 / 52

29 Example 2: Convergence to the global minima 20 / 52

30 Example 2: Convergence to the global minima 20 / 52

31 Example 2: Convergence to the global minima 20 / 52

32 Example 2: Convergence to the global minima 20 / 52

33 Example 2: Convergence to the global minima 20 / 52

34 Example 2: Convergence to the global minima 20 / 52

35 Example 2: Convergence to the global minima 20 / 52

36 Example 2: Convergence to the global minima 20 / 52

37 Example 3: Divergence due to a too large learning rate 21 / 52

38 Example 3: Divergence due to a too large learning rate 21 / 52

39 Example 3: Divergence due to a too large learning rate 21 / 52

40 Example 3: Divergence due to a too large learning rate 21 / 52

41 Example 3: Divergence due to a too large learning rate 21 / 52

42 Example 3: Divergence due to a too large learning rate 21 / 52

43 Stochastic gradient descent In the empirical risk minimization setup, L(θ) L(θ) L(θ) 1 = l(y i, f(x i ; θ)) N x i,y i d and its gradient decomposes as 1 = l(y i, f(x i ; θ)). N x i,y i d Therefore, in batch gradient descent the complexity of an update grows linearly with the size of the dataset. N More importantly, since the empirical risk is already an approximation of the expected risk, it should not be necessary to carry out the minimization with great accuracy. 22 / 52

44 Instead, stochastic gradient descent uses as update rule: θ = θ γ l(y, f(x ; θ )) t+1 t i(t+1) i(t+1) t Iteration complexity is independent of. {θ t = 1,...} The stochastic process t depends on the examples picked randomly at each iteration. N i(t) 23 / 52

45 Instead, stochastic gradient descent uses as update rule: θ = θ γ l(y, f(x ; θ )) t+1 t i(t+1) i(t+1) t Iteration complexity is independent of. {θ t = 1,...} The stochastic process t depends on the examples picked randomly at each iteration. N i(t) Batch gradient descent Stochastic gradient descent 24 / 52

46 Why is stochastic gradient descent still a good idea? Informally, averaging the update over all choices θ = θ γ l(y, f(x ; θ )) t+1 t i(t+1) i(t+1) t i(t + 1) restores batch gradient descent. Formally, if the gradient estimate is unbiased, e.g., if 1 E [ l(y, f(x ; θ i(t+1) i(t+1) i(t+1) t ))] = l(y i, f(x i ; θ t )) N x i,y i d = L(θ ) t then the formal convergence of SGD can be proved, under appropriate assumptions (see references). When decomposing the excess error in terms of approximation, estimation and optimization errors, stochastic algorithms yield the best generalization performance (in terms of expected risk) despite being the worst optimization algorithms (in terms of empirical risk) (Bottou, 2011). x, y P i i X,Y Interestingly, if training examples are received and used in an online fashion, then SGD directly minimizes the expected risk. 25 / 52

47 Layers So far we considered the logistic unit, where,, w R p b R and. h = σ ( w T x + b) h R x R p These units can be composed in parallel to form a layer with T h = σ(w x + b) h R q x R p W R p q b R q outputs: where,,, and where is upgraded to the element-wise sigmoid function. q σ( ) 26 / 52

48 Multi-layer perceptron Similarly, layers can be composed in series, such that: where denotes the model parameters. This model is the multi-layer perceptron, also known as the fully connected feedforward network. Optionally, the last activation values. h 0 h 1... h L f(x; θ) = x T = σ(w h + b ) T = σ(w h + b L L 1 L ) = h L θ {W, b,... k = 1,..., L} y^ R σ k k can be skipped to produce unbounded output 27 / 52

49 28 / 52

50 To minimize with stochastic gradient descent, we need the gradient. Therefore, we require the evaluation of the (total) derivatives l L(θ) l(θ ) dl dw k of the loss with respect to all model parameters,, for. These derivatives can be evaluated automatically from the computational graph of using automatic differentiation., dl db k W b k k k = 1,..., L θ t l 29 / 52

51 Automatic differentiation Consider a 1-dimensional output composition f g, such that y u = f(u) = g(x) = (g (x),..., g (x)). 1 m The chain rule of total derivatives states that dy dx = m k=1 y u k du k dx recursive case Since a neural network is a composition of differentiable functions, the total derivatives of the loss can be evaluated by applying the chain rule recursively over its computational graph. The implementation of this procedure is called (reverse) automatic differentiation (AD). 30 / 52

52 As a guiding example, let us consider a simpli ed 2-layer MLP and the following loss function: f(x; W, W ) 1 2 l(y, ; W, W ) y^ 1 2 x R p y R W = σ ( W σ W 2 T ( 1 T x)) for,, and. 1 = cross_ent(y, y^ ) + λ ( W W 2 2) p q R W R 2 q 31 / 52

53 As a guiding example, let us consider a simpli ed 2-layer MLP and the following loss function: f(x; W, W ) 1 2 l(y, ; W, W ) y^ 1 2 x R p y R W = σ ( W σ W 2 T ( 1 T x)) for,, and. 1 = cross_ent(y, y^ ) + λ ( W W 2 2) p q R W R 2 q 32 / 52

54 The total derivative l W dl dw 1 can be computed backward, by walking through all paths from to in the computational graph and accumulating the terms: 1 dl dw 1 du 8 dw 1 = l du 8 u dw + = l u 4 du 4 dw 1 33 / 52

55 This algorithm is known as reverse-mode automatic differentiation, also called backpropagation. An equivalent procedure can be de ned to evaluate the derivatives in forward mode, from inputs to outputs. Automatic differentiation generalizes to inputs and outputs. if N M otherwise, if, reverse-mode automatic differentiation is computationally more ef cient. M N, forward automatic differentiation is better. Since differentiation is a linear operator, AD can be implemented ef ciently in terms of matrix operations. N M 34 / 52

56 Vanishing gradients Training deep MLPs with many layers has for long (pre-2011) been very dif cult due to the vanishing gradient problem. Small gradients slow down, and eventually block, stochastic gradient descent. This results in a limited capacity of learning. Backpropagated gradients normalized histograms (Glorot and Bengio, 2010). Gradients for layers far from the output vanish to zero. 35 / 52

57 Consider a simpli ed 3-layer MLP, with Under the hood, this would be evaluated as x, w, w, w R 1 2 3, such that f(x; w, w, w ) = σ w σ w σ w ( 3 ( 2 ( 1 x ))). u 1 u 2 u 3 u 4 u 5 y^ = w x 1 = σ(u ) 1 = w u 2 2 = σ(u ) 3 = w u 3 4 = σ(u ) 5 and its derivative d y^ dw 1 as d y^ = dw 1 y^ u 5 u 5 u 4 σ(u ) u 4 u 3 u 3 u 2 σ(u ) u 2 u 1 u 1 w 1 σ(u ) = w w 3 2 x u u u / 52

58 The derivative of the sigmoid activation function σ is: dσ dx 0 (x) dσ dx Notice that for all. 1 4 (x) = σ(x)(1 σ(x)) x 37 / 52

59 Assume that weights are initialized randomly from a Gaussian with zero-mean and small variance, such that with high probability. Then, w, w, w w 1 i d y^ dw 1 = σ(u 5) σ(u 3) σ(u 1) w w 3 2 u u u x This implies that the gradient layers in the network increases. d y^ dw 1 exponentially shrinks to zero as the number of Hence the vanishing gradient problem. In general, bounded activation functions (sigmoid, tanh, etc) are prone to the vanishing gradient problem. Note the importance of a proper initialization scheme. 38 / 52

60 Recti ed linear units Instead of the sigmoid activation function, modern neural networks are for most based on recti ed linear units (ReLU) (Glorot et al, 2011): ReLU(x) = max(0, x) 39 / 52

61 Note that the derivative of the ReLU function is d dx ReLU(x) = { 0 1 if x 0 otherwise For x = 0, the derivative is unde ned. In practice, it is set to zero. 40 / 52

62 Therefore, d y^ dw 1 σ(u 5) σ(u = w 3) σ(u w 1) 3 2 x u u u =1 =1 =1 This solves the vanishing gradient problem, even for deep networks! (provided proper initialization) Note that: The ReLU unit dies when its input is negative, which might block gradient descent. This is actually a useful property to induce sparsity. This issue can also be solved using leaky ReLUs, de ned as for a small (e.g., ). LeakyReLU(x) = max(αx, x) α R + α = / 52

63 Universal approximation Theorem. (Cybenko 1989; Hornik et al, 1991) Let be a bounded, non-constant continuous function. Let denote the -dimensional hypercube, and denote the space of continuous functions on. Given any and, there exists and such that satis es q > 0 σ( ) I p p C(I p) I p f C(I p) ϵ > 0 v, w, b, i = 1,..., q i i i F (x) = v σ(w x + b ) sup x I p i i T i i q f(x) F (x) < ϵ. It guarantees that even a single hidden-layer network can represent any classi cation problem in which the boundary is locally linear (smooth); It does not inform about good/bad architectures, nor how they relate to the optimization procedure. The universal approximation theorem generalizes to any non-polynomial (possibly unbounded) activation function, including the ReLU (Leshno, 1993). 42 / 52

64 Theorem (Barron, 1992) The mean integrated square error between the estimated network and the target function is bounded by F^ where is the number of training points, is the number of neurons, is the input dimension, and measures the global smoothness of. Combines approximation and estimation errors. f C qp ( + q N ) O log N f 2 N q p C f f Provided enough data, it guarantees that adding more neurons will result in a better approximation. 43 / 52

65 Consider the 1-layer MLP f(x) = w i ReLU(x + b i ). This model can approximate any smooth 1D function, provided enough hidden units. 44 / 52

66 Consider the 1-layer MLP f(x) = w i ReLU(x + b i ). This model can approximate any smooth 1D function, provided enough hidden units. 44 / 52

67 Consider the 1-layer MLP f(x) = w i ReLU(x + b i ). This model can approximate any smooth 1D function, provided enough hidden units. 44 / 52

68 Consider the 1-layer MLP f(x) = w i ReLU(x + b i ). This model can approximate any smooth 1D function, provided enough hidden units. 44 / 52

69 Consider the 1-layer MLP f(x) = w i ReLU(x + b i ). This model can approximate any smooth 1D function, provided enough hidden units. 44 / 52

70 Consider the 1-layer MLP f(x) = w i ReLU(x + b i ). This model can approximate any smooth 1D function, provided enough hidden units. 44 / 52

71 Consider the 1-layer MLP f(x) = w i ReLU(x + b i ). This model can approximate any smooth 1D function, provided enough hidden units. 44 / 52

72 Consider the 1-layer MLP f(x) = w i ReLU(x + b i ). This model can approximate any smooth 1D function, provided enough hidden units. 44 / 52

73 Consider the 1-layer MLP f(x) = w i ReLU(x + b i ). This model can approximate any smooth 1D function, provided enough hidden units. 44 / 52

74 Consider the 1-layer MLP f(x) = w i ReLU(x + b i ). This model can approximate any smooth 1D function, provided enough hidden units. 44 / 52

75 Consider the 1-layer MLP f(x) = w i ReLU(x + b i ). This model can approximate any smooth 1D function, provided enough hidden units. 44 / 52

76 Consider the 1-layer MLP f(x) = w i ReLU(x + b i ). This model can approximate any smooth 1D function, provided enough hidden units. 44 / 52

77 Consider the 1-layer MLP f(x) = w i ReLU(x + b i ). This model can approximate any smooth 1D function, provided enough hidden units. 44 / 52

78 (Bayesian) In nite networks Consider the 1-layer MLP with a hidden layer of size σ function : q and a bounded activation f(x) h (x) j ( j q j j j=1 p i=1 = b + v h (x) = σ a + u x i,j i) Assume Gaussian priors,, and a N (0, σ ) j a 2. v N (0, σ j v 2 ) b N (0, σ b 2 ) u N (0, σ ) i,j u 2 45 / 52

79 For a xed value, let us consider the prior distribution of implied by the prior distributions for the weights and biases. We have v h j (x ) x (1) f(x ) E[v h (x )] = E[v ]E[h (x )] = 0, j since and j are statistically independent and j has zero mean by hypothesis. The variance of the contribution of each hidden unit j (1) j j (1) (1) v h j is (1) V[v h (x )] j j (1) = E[(v h (x )) ] E[v h (x )] j j 2 j (1) 2 = E[v ]E[h (x ) ] j (1) 2 = σ E[h (x ) ], v 2 j (1) 2 j j (1) 2 which must be nite since (1) h j V (x ) = E[h (x ) ] is bounded by its activation function. We de ne, and is the same for all. j (1) 2 j 46 / 52

80 By the Central Limit Theorem, as q q v h (1) j=1 j j(x) f(x ) qσ V (x )., the total contribution of the hidden units,, to the value of becomes a Gaussian with variance v 2 (1) b σ (1) f(x ) σ + qσ V (x ) The bias is also Gaussian, of variance, so for large, the prior distribution is a Gaussian of variance. Accordingly, for, for some xed, the prior converges to a Gaussian of mean zero and variance as. For two or more xed values q σ = ω v vq 1 2 (1) (2) b 2 b 2 v 2 (1) x, x,... ω v f(x ) σ + ω σ b 2 v 2 v 2 V (x (1) ) q, a similar argument shows that, as, the joint distribution of the outputs converges to a multivariate Gaussian with means of zero and covariances of q (1) (1) (2) E[f(x )f(x )] q b 2 j=1 = σ + σ E[h (x )h (x )] v 2 j (1) j (2) (1) (2) = σ + ω b 2 2 (1) (2) vc(x, x ) C(x, x ) = E[h (x )h (x )] where and is the same for all. j (1) j (2) j 47 / 52

81 This result states that for any set of xed points of (1) (2) f(x ), f(x ),... is a multivariate Gaussian. (1) (2) x, x,..., the joint distribution In other words, the in nitely wide 1-layer MLP converges towards a Gaussian process. (Neal, 1995) 48 / 52

82 Effect of depth Theorem (Montúfar et al, 2014) A recti er neural network with input units and q (L 1)p p hidden layers of width q p can compute functions that have Ω(( ) q ) p linear regions. That is, the number of linear regions of deep models grows exponentially in q and polynomially in. L q Even for small values of and, deep recti er models are able to produce substantially more linear regions than shallow recti er models. p L L 49 / 52

83 Cooking recipe Get data (loads of them). Get good hardware. De ne the neural network architecture as a composition of differentiable functions. Stick to non-saturating activation function to avoid vanishing gradients. Prefer deep over shallow architectures. Optimize with (variants of) stochastic gradient descent. Evaluate gradients with automatic differentiation. 50 / 52

84 References Materials from the rst part of the lecture are inspired from the excellent Deep Learning Course by Francois Fleuret (EPFL, 2018). Lecture 3a: Linear classi ers, perceptron Lecture 3b: Multi-layer perceptron Further references: Introduction to ML and Stochastic optimization Why are deep neural networks hard to train? Automatic differentiation in machine learning: a survey 51 / 52

85 Quiz Explain the model assumptions behind the sigmoid activation function. Why is stochastic gradient descent a good algorithm for learning? (despite being a poor optimization procedure in general) What is automatic differentiation? Why is training deep network with saturating activation functions dif cult? Study the importance of weight initialization. What is the universal approximation theorem? 52 / 52

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 12 Feb 23, 2018 1 Neural Networks Outline

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Deep Learning. A journey from feature extraction and engineering to end-to-end pipelines. Part 2: Neural Networks. Andrei Bursuc

Deep Learning. A journey from feature extraction and engineering to end-to-end pipelines. Part 2: Neural Networks. Andrei Bursuc Deep Learning A journey from feature extraction and engineering to end-to-end pipelines Part 2: Neural Networks Andrei Bursuc With slides from A. Karpathy, F. Fleuret, J. Johnson, S. Yeung, G. Louppe,

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Learning Deep Architectures for AI. Part I - Vijay Chakilam Learning Deep Architectures for AI - Yoshua Bengio Part I - Vijay Chakilam Chapter 0: Preliminaries Neural Network Models The basic idea behind the neural network approach is to model the response as a

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Ch.6 Deep Feedforward Networks (2/3)

Ch.6 Deep Feedforward Networks (2/3) Ch.6 Deep Feedforward Networks (2/3) 16. 10. 17. (Mon.) System Software Lab., Dept. of Mechanical & Information Eng. Woonggy Kim 1 Contents 6.3. Hidden Units 6.3.1. Rectified Linear Units and Their Generalizations

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

Machine Learning Basics III

Machine Learning Basics III Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

Machine Learning

Machine Learning Machine Learning 10-315 Maria Florina Balcan Machine Learning Department Carnegie Mellon University 03/29/2019 Today: Artificial neural networks Backpropagation Reading: Mitchell: Chapter 4 Bishop: Chapter

More information

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections

More information

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks COMP 551 Applied Machine Learning Lecture 14: Neural Networks Instructor: Ryan Lowe (ryan.lowe@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted,

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

Advanced statistical methods for data analysis Lecture 2

Advanced statistical methods for data analysis Lecture 2 Advanced statistical methods for data analysis Lecture 2 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1 Outline

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole

More information

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Artificial Neural Networks. MGS Lecture 2

Artificial Neural Networks. MGS Lecture 2 Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation

More information

Deep Neural Networks (1) Hidden layers; Back-propagation

Deep Neural Networks (1) Hidden layers; Back-propagation Deep Neural Networs (1) Hidden layers; Bac-propagation Steve Renals Machine Learning Practical MLP Lecture 3 4 October 2017 / 9 October 2017 MLP Lecture 3 Deep Neural Networs (1) 1 Recap: Softmax single

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Linear discriminant functions

Linear discriminant functions Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Neural networks COMS 4771

Neural networks COMS 4771 Neural networks COMS 4771 1. Logistic regression Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation) Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation

More information

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04 Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures Lecture 04 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Self-Taught Learning 1. Learn

More information

4. Multilayer Perceptrons

4. Multilayer Perceptrons 4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output

More information

Neural Networks: Introduction

Neural Networks: Introduction Neural Networks: Introduction Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others 1

More information

Lecture 4 Backpropagation

Lecture 4 Backpropagation Lecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 5, 2017 Things we will look at today More Backpropagation Still more backpropagation Quiz

More information

Intelligent Systems Discriminative Learning, Neural Networks

Intelligent Systems Discriminative Learning, Neural Networks Intelligent Systems Discriminative Learning, Neural Networks Carsten Rother, Dmitrij Schlesinger WS2014/2015, Outline 1. Discriminative learning 2. Neurons and linear classifiers: 1) Perceptron-Algorithm

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

CSC 411 Lecture 10: Neural Networks

CSC 411 Lecture 10: Neural Networks CSC 411 Lecture 10: Neural Networks Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 10-Neural Networks 1 / 35 Inspiration: The Brain Our brain has 10 11

More information

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4 Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:

More information

Introduction to Neural Networks

Introduction to Neural Networks CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Artificial Neural Networks

Artificial Neural Networks Artificial Neural Networks 鮑興國 Ph.D. National Taiwan University of Science and Technology Outline Perceptrons Gradient descent Multi-layer networks Backpropagation Hidden layer representations Examples

More information

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 13 Mar 1, 2018 1 Reminders Homework 5: Neural

More information

Introduction to (Convolutional) Neural Networks

Introduction to (Convolutional) Neural Networks Introduction to (Convolutional) Neural Networks Philipp Grohs Summer School DL and Vis, Sept 2018 Syllabus 1 Motivation and Definition 2 Universal Approximation 3 Backpropagation 4 Stochastic Gradient

More information

Lecture 2: Learning with neural networks

Lecture 2: Learning with neural networks Lecture 2: Learning with neural networks Deep Learning @ UvA LEARNING WITH NEURAL NETWORKS - PAGE 1 Lecture Overview o Machine Learning Paradigm for Neural Networks o The Backpropagation algorithm for

More information

Neural networks. Chapter 20. Chapter 20 1

Neural networks. Chapter 20. Chapter 20 1 Neural networks Chapter 20 Chapter 20 1 Outline Brains Neural networks Perceptrons Multilayer networks Applications of neural networks Chapter 20 2 Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms

More information

Artificial Neuron (Perceptron)

Artificial Neuron (Perceptron) 9/6/208 Gradient Descent (GD) Hantao Zhang Deep Learning with Python Reading: https://en.wikipedia.org/wiki/gradient_descent Artificial Neuron (Perceptron) = w T = w 0 0 + + w 2 2 + + w d d where

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

Learning Neural Networks

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex decision boundaries Variable size. Any boolean function can be represented. Hidden units can be interpreted as new features Deterministic

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Neural Networks Varun Chandola x x 5 Input Outline Contents February 2, 207 Extending Perceptrons 2 Multi Layered Perceptrons 2 2. Generalizing to Multiple Labels.................

More information

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Neural Networks Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Outline Part 1 Introduction Feedforward Neural Networks Stochastic Gradient Descent Computational Graph

More information

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

More information

CSC242: Intro to AI. Lecture 21

CSC242: Intro to AI. Lecture 21 CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Deep Neural Networks (1) Hidden layers; Back-propagation

Deep Neural Networks (1) Hidden layers; Back-propagation Deep Neural Networs (1) Hidden layers; Bac-propagation Steve Renals Machine Learning Practical MLP Lecture 3 2 October 2018 http://www.inf.ed.ac.u/teaching/courses/mlp/ MLP Lecture 3 / 2 October 2018 Deep

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Computational statistics

Computational statistics Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

More information

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat Neural Networks, Computation Graphs CMSC 470 Marine Carpuat Binary Classification with a Multi-layer Perceptron φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ

More information

Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville

Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville Chapter 6 :Deep Feedforward Networks Benoit Massé Dionyssos Kounades-Bastian Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,

More information

Logistic Regression & Neural Networks

Logistic Regression & Neural Networks Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability

More information

Multilayer Perceptron

Multilayer Perceptron Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Single Perceptron 3 Boolean Function Learning 4

More information

Lecture 2: Modular Learning

Lecture 2: Modular Learning Lecture 2: Modular Learning Deep Learning @ UvA MODULAR LEARNING - PAGE 1 Announcement o C. Maddison is coming to the Deep Vision Seminars on the 21 st of Sep Author of concrete distribution, co-author

More information

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

More information

Neural networks. Chapter 19, Sections 1 5 1

Neural networks. Chapter 19, Sections 1 5 1 Neural networks Chapter 19, Sections 1 5 Chapter 19, Sections 1 5 1 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 19, Sections 1 5 2 Brains 10

More information

Unit III. A Survey of Neural Network Model

Unit III. A Survey of Neural Network Model Unit III A Survey of Neural Network Model 1 Single Layer Perceptron Perceptron the first adaptive network architecture was invented by Frank Rosenblatt in 1957. It can be used for the classification of

More information

Feedforward Neural Nets and Backpropagation

Feedforward Neural Nets and Backpropagation Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28 th, 2016 1 / 23 Supervised Learning Roadmap Supervised Learning: Assume that we are given the features

More information

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning & Artificial Intelligence WS 2018/2019 Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target

More information

y(x n, w) t n 2. (1)

y(x n, w) t n 2. (1) Network training: Training a neural network involves determining the weight parameter vector w that minimizes a cost function. Given a training set comprising a set of input vector {x n }, n = 1,...N,

More information

Understanding Neural Networks : Part I

Understanding Neural Networks : Part I TensorFlow Workshop 2018 Understanding Neural Networks Part I : Artificial Neurons and Network Optimization Nick Winovich Department of Mathematics Purdue University July 2018 Outline 1 Neural Networks

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others

More information

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang Deep Feedforward Networks Han Shao, Hou Pong Chan, and Hongyi Zhang Deep Feedforward Networks Goal: approximate some function f e.g., a classifier, maps input to a class y = f (x) x y Defines a mapping

More information

Perceptron & Neural Networks

Perceptron & Neural Networks School of Computer Science 10-701 Introduction to Machine Learning Perceptron & Neural Networks Readings: Bishop Ch. 4.1.7, Ch. 5 Murphy Ch. 16.5, Ch. 28 Mitchell Ch. 4 Matt Gormley Lecture 12 October

More information

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks INFOB2KI 2017-2018 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Artificial Neural Networks Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Artificial Neural Network

Artificial Neural Network Artificial Neural Network Eung Je Woo Department of Biomedical Engineering Impedance Imaging Research Center (IIRC) Kyung Hee University Korea ejwoo@khu.ac.kr Neuron and Neuron Model McCulloch and Pitts

More information

ECE521 Lecture 7/8. Logistic Regression

ECE521 Lecture 7/8. Logistic Regression ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression

More information

CS489/698: Intro to ML

CS489/698: Intro to ML CS489/698: Intro to ML Lecture 03: Multi-layer Perceptron Outline Failure of Perceptron Neural Network Backpropagation Universal Approximator 2 Outline Failure of Perceptron Neural Network Backpropagation

More information

Neural Networks (Part 1) Goals for the lecture

Neural Networks (Part 1) Goals for the lecture Neural Networks (Part ) Mark Craven and David Page Computer Sciences 760 Spring 208 www.biostat.wisc.edu/~craven/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed

More information

Gradient-Based Learning. Sargur N. Srihari

Gradient-Based Learning. Sargur N. Srihari Gradient-Based Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation

More information

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis Introduction to Natural Computation Lecture 9 Multilayer Perceptrons and Backpropagation Peter Lewis 1 / 25 Overview of the Lecture Why multilayer perceptrons? Some applications of multilayer perceptrons.

More information

Neural networks. Chapter 20, Section 5 1

Neural networks. Chapter 20, Section 5 1 Neural networks Chapter 20, Section 5 Chapter 20, Section 5 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 20, Section 5 2 Brains 0 neurons of

More information

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav Neural Networks 30.11.2015 Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav 1 Talk Outline Perceptron Combining neurons to a network Neural network, processing input to an output Learning Cost

More information

Artificial Neural Networks

Artificial Neural Networks Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation Hidden layer representations Example: Face Recognition Advanced topics 1 Connectionist Models Consider humans:

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16 Slides adapted from Jordan Boyd-Graber, Justin Johnson, Andrej Karpathy, Chris Ketelsen, Fei-Fei Li, Mike Mozer, Michael Nielson

More information

Feedforward Neural Networks

Feedforward Neural Networks Feedforward Neural Networks Michael Collins 1 Introduction In the previous notes, we introduced an important class of models, log-linear models. In this note, we describe feedforward neural networks, which

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer. University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

More information

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler + Machine Learning and Data Mining Multi-layer Perceptrons & Neural Networks: Basics Prof. Alexander Ihler Linear Classifiers (Perceptrons) Linear Classifiers a linear classifier is a mapping which partitions

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Course 395: Machine Learning - Lectures

Course 395: Machine Learning - Lectures Course 395: Machine Learning - Lectures Lecture 1-2: Concept Learning (M. Pantic) Lecture 3-4: Decision Trees & CBC Intro (M. Pantic & S. Petridis) Lecture 5-6: Evaluating Hypotheses (S. Petridis) Lecture

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

More information

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning  Ian Goodfellow Last updated Deep Feedforward Networks Lecture slides for Chapter 6 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last updated 2016-10-04 Roadmap Example: Learning XOR Gradient-Based Learning Hidden Units

More information