More Tips for Training Neural Network. Hung-yi Lee

Size: px
Start display at page:

Download "More Tips for Training Neural Network. Hung-yi Lee"

Transcription

1 More Tips for Training Neural Network Hung-yi ee

2 Outline Activation Function Cost Function Data Preprocessing Training Generalization

3 Review: Training Neural Network Neural network: f ; θ : input (vector) θ: model parameters (weights, biases) Cost Function: C θ How good your model parameters θ is C θ =, y T C θ C θ = d f ; θ, y y: reference output when is input d f ; θ, y : difference between y and the output of neural network with model parameters θ Target: find θ that minimizes C θ

4 Review: Gradient descent Target: find θ that minimizing C θ C θ =, y T C θ Gradient descent Initialization: θ 0 θ t θ t η C θ t Stochastic gradient descent Pick, y from training data set T θ t θ t η C θ t C θ = C θ l w ij C θ b i l

5 Review: Backpropagation l i l ij l i l ij z w z w C C Forward Pass Backward Pass l l z a l T l l l W z l l l l b a W z y z δ C l l a j l j j i l w ij ayer l ayer l b W z z a T W z l i Error signal for backprop l z i

6 Review: Backpropagation ayer - - z z m z m W T ayer n z z z n C C y C y C y C w y n l ij z w l i l ij δ l C z l i Backward Pass l i z C y T z W l l T l z W Error signal for backprop

7 Outline Activation Function Cost Function Data Preprocessing Training Generalization

8 Problem of Sigmoid Sigmoid Function Derivative of Sigmoid Function Derivative of Sigmoid Function is always smaller than

9 Vanishing Gradient Problem Backward Pass: ayer - ayer - z z m z m W T n z z z n C C y C y C y y n For sigmoid function, σ z always smaller than Error signal is getting smaller and smaller Gradient is smaller C w l ij z w l i l ij C z l i l i

10 Vanishing Gradient Problem Input ayer ayer ayer Output y vector 3 y vector y earn very slowly Still random earn faster Already converge The weights are converged based on random!?

11 ReU PReU a a = z Rectified inear Unit (ReU) a = αz z σ z a a = z σ z a a = 0 z 0 z

12 ReU Backward Pass: ayer - - z z m z m W T ayer n z z z n C C y C y C y y n

13 ReU Backward Pass: ayer - No Attenuation - 0 m 0 W T ayer 0 n 0 C C y C y C y y n

14 ReU Backward Pass: n C y y C n y C ayer y C n n z C n nj n nj z w z w C C All the weights connected to this neuron will have zero gradient.

15 Maout ReU or PReu is just a special case of Maout Input + z + z Ma ma z, z a + z + z Ma a z = W + z 3 + z 4 Ma a + z 3 + z 4 Ma a a z z z = W a a

16 Maout ReU is special case Input w b z ReU a Input 0 w 0 b + z Ma a + z ma z, z a z = w + b a z = w + b z =0

17 Maout ReU is special case Input w b z ReU a a Input w w b z = w + b b + z Ma a + z ma z, z earnable Activation Function a z = w + b z = w + b

18 Maout - Training by Backpropagation In forward pass, we know which z would be the ma Input + z + z Ma a ma z, z + z + z Ma a + z 3 + z 3 + z 4 Ma a + z 4 Ma a a a

19 Maout - Training by Backpropagation In forward pass, we know which z would be the ma Input + z + z a + z + z a + z 3 a + z 3 + z 4 a + z 4 a In backward pass, treat it as network whose neurons having linear activation functions a

20 Outline Activation Function Cost Function Data Preprocessing Training Generalization

21 Problem of Sigmoid - z z z z C C y C y y y ŷ y ŷ C y y yˆ y n y n n ˆ arger difference, arger gradient m z m W T n z n C y n y ˆ n y n 0 arge difference z tiny derivative How about ReU? z

22 ReU is probably not suitable for output layer y = 0 0 Only one dimension is, and others are all 0 ReU More similar? y = ReU y = arger output means larger confidence Better? It is better to let the output bounded.

23 Softma Softma layer as the output layer Ordinary Output layer z y z z y z z 3 y3 z3

24 Softma Softma layer as the output layer Softma ayer Probability: > y i > 0 i y i = z z z 3 3 e z e.7-3 e z e 0 e z e 3 j 0.05 z j e 0.88 y 0. y 0 y 3 e e e z z z 3 3 j 3 j 3 j e e e z z z j j j

25 Softma Define cost function y i j e z i e z j Minimize C θ = C θ = logy t, y T C θ z z z i Softma y y y i ŷ ŷ ŷ i y = 0 0 Only one dimension is, and others are all 0 t Inde of the dimension which is

26 Softma Define cost function z z y z i i j Softma e z i e z j y y y i Minimize C θ = C θ = logy t y ŷ y y = ŷ y t ŷ i, y T y = C θ Cross Entropy 0 0 Increase y t Don t we have to decrease other dimensions?

27 y i i = t j e z i e C θ = logy t y tˆ tˆ z C z j tˆ j C y e z tˆ e tˆ z j y z tˆ tˆ yˆ t i y z y i tˆ tˆ C z i z z z i yˆ t y z t appears in both numerator and denominator tˆ Softma y tˆ y tˆ y y y i yˆ t ŷ ŷ ŷ i The absolute value of δ t is larger when y t is far from

28 y i i t y i tˆ j e z i e C θ = logy t z j C y C z j e i z tˆ tˆ e y z z j tˆ i yˆ t i y i yt ˆ zi C z i 0 z z z i yˆ t z i appears only in denominator Softma yˆ y t i y y y i yi 0 ŷ ŷ ŷ i The absolute value of δ i is larger when y i is larger

29 Outline Activation Function Cost Function Data Preprocessing Training Generalization

30 Data Preprocessing For each dimension i: mean: m i standard deviation: σ i 3 r R r i i r m i The means of all dimensions are 0, σ i and the variances are all

31 Outline Activation Function Cost Function Data Preprocessing Training Generalization

32 Outline - Training Activation Function Cost Function Data Preprocessing Momentum Training Generalization earning Rates

33 Outline - Training Activation Function Cost Function Data Preprocessing Momentum Training Generalization earning Rates

34 Problem of Gradient descent cost Very slow at the plateau Gradient is small Stuck at local minima Gradient is Zero parameter space

35 In physical world Momentum How about put this phenomenon in gradient descent?

36 Original Gradient descent C θ 0 θ 0 θ Gradient Movement C θ θ θ 3 C θ C θ 3 Start at position θ 0 Compute gradient at θ 0 Move to θ = θ 0 - μ C θ 0 Compute gradient at θ Move to θ = θ μ C θ

37 Gradient descent with Momentum C θ 0 C θ θ 0 θ Gradient Movement Momentum θ θ 3 Start at point θ 0 Momentum v 0 =0 (v i : movement at the i-th update) Compute gradient at θ C θ 0 Momentum v = λv 0 - μ C θ 0 Move to θ = θ 0 + v Compute gradient at θ C θ 3 Momentum v = λv - μ C θ Move to θ = θ + v

38 Gradient descent with Momentum v i is actually the weighted sum of all the previous gradient: C θ 0, C θ, C θ i v 0 =0 v = - μ C θ 0 v = - λ μ C θ 0 - μ C θ Start at point θ 0 Momentum v 0 =0 (v i : movement at the i-th update) Compute gradient at θ 0 Momentum v = λv 0 - μ C θ 0 Move to θ = θ 0 + v Compute gradient at θ Momentum v = λv - μ C θ Move to θ = θ + v

39 Gradient descent with cost Momentum Problem Gradient Movement Momentum Gradient = 0 parameter space

40 NAG Momentum Gradient Movement Momentum Nesterov s Accelerated Gradient (NAG) Gradient = 0 Gradient = 0

41 Gradient descent, Momentum, NAG Methods: Gradient descent Momentum NAG Source:

42 Outline - Training Activation Function Cost Function Data Preprocessing Momentum Training Generalization earning Rates

43 earning Rates cost Very arge arge Just make small epoch Error Surface

44 earning Rates Popular & Simple Idea: Reduce the learning rate by some factor every few epochs. At the beginning, we are far from a minimum, so we use larger learning rate After several epochs, we are close to a minimum, so we reduce the learning rate E.g. /t decay: η t = η (t + ) Not always true

45 Adaptive earning Rates Each parameter should have different learning rates Smaller earning Rate arger Gradient w w arger earning Rate Smaller earning Rate

46 RProp g t : gradient obtained with w t Each parameter should have different learning rates Gradient descent: w t+ w t ηg t If g t > 0 Another Aspect: - If g t < 0 The learning rate is η g t. wt+ w t η g t gt

47 RProp - Problem C w = E[ C w ] RProp is problematic for stochastic gradient descent Gradient descent: w t+ w t η C 0. w Stochastic Gradient descent: w t+ w t η C w

48 Adagrad Divide the learning rate by average gradient w t+ w t η σ gt σ: Average gradient of parameter w Estimated while updating the parameters If w has small average gradient σ If w has large average gradient σ arger learning rate Smaller learning rate

49 Adagrad w w 0 η σ 0 g0 σ 0 = g 0 w w η σ g σ = g0 + g w 3 w η σ g σ = 3 g0 + g + g w t+ w t η gt σt σ t = t t + i=0 g i

50 Adagrad Divide the learning rate by average gradient The average gradient is obtained while updating the parameters η t w t+ w t η σ t gt σ t = t t + i=0 g i η t = η t + /t decay w t+ w t η t i=0 g i g t

51 RMSProp Error Surface can be comple in deep learning. Smaller earning Rate w arger earning Rate w

52 RMSProp w w 0 η σ 0 g0 w w η σ g σ 0 = g 0 σ = α σ 0 + α g w 3 w η σ g σ = α σ + α g w t+ w t η gt σt σ t = α σ t + α g t

53 izing_gradient_optimization_techniques/cklhott (By Alec Radford)

54 izing_gradient_optimization_techniques/cklhott (By Alec Radford)

55 Not the whole story Adadelta ogletr0.pdf Adam AdaSecant No more pesky learning rates Using second order information

56 Outline Activation Function Cost Function Data Preprocessing Training Generalization

57 Outline - Generalization Activation Function Cost Function Data Preprocessing Training Weight Decay Generalization Dropout

58 Outline - Generalization Activation Function Cost Function Data Preprocessing Training Weight Decay Generalization Dropout

59 Weight Decay Our Brain

60 Weight Decay The parameters closer to zero is preferred. Training data:, yˆ, } { i w w w i z z w yˆ z Testing data:, yˆ, } { z' To minimize the effect of noise, we want w close to zero. w w w z w

61 Weight Decay New cost function to be minimized Find a set of weight not only minimizing original cost but also close to zero C C Original cost (e.g. minimize square error, cross entropy ) Regularization term: W, W, w w w w (not consider biases. why?)

62 Weight Decay New cost function to be minimized Gradient: w w w C C Update: w w w t t C t t w w w C w t w C Smaller and smaller w w w w C C

63 Outline - Generalization Activation Function Cost Function Data Preprocessing Training Weight Decay Generalization Dropout

64 Dropout θ t θ t η C θ t Training: (stochastic gradient descent) In each iteration Each neuron has p% to dropout

65 Dropout θ t θ t η C θ t Training: (stochastic gradient descent) Thinner! In each iteration Each neuron has p% to dropout The structured of the network is changed. Using the new network for training For each iteration, we resample the dropout neurons

66 Dropout Testing: No dropout If the dropout rate at training is p%, all the weights times (-p)% Assume that the dropout rate is 50%. l l If w ij = from training, set w ij = 0.5 for testing.

67 Dropout - Intuitive Reason 我的 partner 會擺爛, 所以我要好好做 When teams up, if everyone epect the partner will do the work, nothing will be done finally. However, if you know your partner will dropout, you will do better. When testing, no one dropout actually, so obtaining good results eventually.

68 Dropout - Intuitive Reason Why the weights should multiply (-p)% (dropout rate) when testing? Training of Dropout Testing of Dropout Assume dropout rate is 50% w w w 3 w 4 z No dropout w w w 3 w 4 Weights from training z z z Weights multiply (-p)% z z

69 Dropout - Ensemble Ensemble Training Set Set Set Set 3 Set 4 Network Network Network 3 Network 4 Train a bunch of networks with different structures

70 Dropout - Ensemble Ensemble Testing data Network Network Network 3 Network 4 y y y 3 y 4 average

71 Dropout - Ensemble Dropout Ensemble. Training of Dropout 3 4 M neurons M possible networks Using one data to train one network Some parameters in the network are shared

72 Dropout - Ensemble Testing of Dropout testing data Dropout Ensemble. All the weights multiply (-p)% y y y 3 average? y

73 Dropout - Ensemble Eperiments on hand writing digital classification Ref:

74 Practical Suggestion for Dropout arger network If you know your task need n neurons, for dropout rate p, your network need n/(-p) neurons. onger training time Higher learning rate arger momentum

75 Not the whole story To learn more about the theory of dropout Baldi, Pierre, and Peter J. Sadowski. "Understanding dropout." Advances in Neural Information Processing Systems. 03. DropConnect Reference: Wan,., Zeiler, M., Zhang, S., Cun, Y.., & Fergus, R. (03). Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on Machine earning (ICM-3) Standout Reference: Ba, J., & Frey, B. (03). Adaptive dropout for training deep neural networks. In Advances in Neural Information Processing Systems

76 Concluding Remarks Outlook: How to connect neurons Parameters Initialization Activation Function Cost Function Data Preprocessing Training Generalization ReU, Maout Softma + Cross Entropy Mean=0, Variance= Momentum, NAG, Adagrad, RMSProp Weight decay, Dropout

77 Acknowledgement 感謝賴威昇同學糾正投影片上的錯誤

78 Appendi

79 Optimizaton method ahiriprochnowng0.pdf For RNN kever3.pdf

80 Kludge? The smallness of the weights means that the behaviour of the network won't change too much if we change a few random inputs here and there. That makes it difficult for a regularized network to learn the effects of local noise in the data. Simple is not always better If one had judged merely on the grounds of simplicity, then some modified form of Newton's theory would arguably have been more attractive. the dynamics of gradient descent learning in multilayer nets has a `self-regularization' effect"

81 Use genetic approach

82 Story of NN ents/5lnbt/ama_yann_lecun/chivdv7

83 Reference Everything about tuing parameters

84 Activation Functions a Softplus: a ln e z a ma a e z Softsign: a a e e z z e e z z z

85 RProp Each parameter has different learning rate θ i θ i η i C θ i If at t- and t-th iterations, C θ i has the same sign Otherwise: - η i : parameter dependent η i η i 0.5 η i η i. (decrease) If C θ i > 0 If C θ i < 0 (increase)

86 RProp Original Gradient descent RProp 0.0 ~ 0.3k updates updates

87 Adadelta

88 MNIST TIMIT

89 Error Surface: Square Error v.s. Cross Entropy Red: Square Error Black: Cross Entropy Cost w w Ref: Glorot, X., & Bengio, Y. (00). Understanding the difficulty of training deep feedforward neural networks. In International conference on artificial intelligence and statistics

90 Reference Mean Variance Decorrelation

91 ReU Backward Pass: l δ ayer l l z l z ayer l + δ l z z l l ayer - - z z ayer z z C C y C y y i l z i l W T k l z k m z m W T n z n C y n

92 Start at point θ 0 Momentum v 0 =0 (Movement at the last step) Compute gradient at θ 0 Momentum v = λv 0 - μ C θ 0 Move to θ = θ 0 + v Compute gradient at θ Momentum v = λv - μ C θ First make a big jump in the direction of the previous accumulated gradient. Move to θ = θ + v Then measure the gradient where you end up and make a correction. Its better to correct a mistake after you have made it! Start at point θ 0 Momentum v 0 =0 (Movement at the last step) Compute gradient at θ 0 Momentum v = λv 0 - μ C θ 0 + λv 0 Move to θ = θ 0 + v Compute gradient at θ Momentum v = λv - μ C θ + λv Move to θ = θ + v Compute gradient at θ Momentum v 3 = λv - μ C θ + λv Move to θ 3 = θ + v 3

93 Start at point θ 0 Momentum v 0 =0 (Movement at the last step) Compute gradient at θ 0 Momentum v = λv 0 - μ C θ 0 + λv 0 Move to θ = θ 0 + v Compute gradient at θ Momentum v = λv - μ C θ + λv Move to θ = θ + v Compute gradient at θ Momentum v 3 = λv - μ C θ + λv Move to θ 3 = θ + v 3 Start at point θ 0 Momentum v 0 =0 (Movement at the last step) Momentum v =0- μ C θ 0 Move to θ = θ 0 + v Move to θ = θ + λv Momentum v = λv - μ C θ Move to θ = θ + v Momentum v 3 = λv - μ C θ + Move to θ 3 = θ + v 3

94 Why Zero Mean? w C w j j C z l i the same for all j Input w ayer The inputs are always positive j w j i w w Sigmoid <? Hyperbolic Tangent Initial: target:

Tips for Deep Learning

Tips for Deep Learning Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results

More information

Tips for Deep Learning

Tips for Deep Learning Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results

More information

Deep Learning Tutorial. 李宏毅 Hung-yi Lee

Deep Learning Tutorial. 李宏毅 Hung-yi Lee Deep Learning Tutorial 李宏毅 Hung-yi Lee Outline Part I: Introduction of Deep Learning Part II: Why Deep? Part III: Tips for Training Deep Neural Network Part IV: Neural Network with Memory Part I: Introduction

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee New Activation Function Rectified Linear Unit (ReLU) σ z a a = z Reason: 1. Fast to compute 2. Biological reason a = 0 [Xavier Glorot, AISTATS 11] [Andrew L.

More information

Deep Learning. Hung-yi Lee 李宏毅

Deep Learning. Hung-yi Lee 李宏毅 Deep Learning Hung-yi Lee 李宏毅 Deep learning attracts lots of attention. I believe you have seen lots of exciting results before. Deep learning trends at Google. Source: SIGMOD 206/Jeff Dean 958: Perceptron

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Deep learning attracts lots of attention.

Deep learning attracts lots of attention. Deep Learning Deep learning attracts lots of attention. I believe you have seen lots of exciting results before. Deep learning trends at Google. Source: SIGMOD/Jeff Dean Ups and downs of Deep Learning

More information

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning & Artificial Intelligence WS 2018/2019 Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target

More information

Deep Learning Tutorial. 李宏毅 Hung-yi Lee

Deep Learning Tutorial. 李宏毅 Hung-yi Lee Deep Learning Tutorial 李宏毅 Hung-yi Lee Deep learning attracts lots of attention. Google Trends Deep learning obtains many exciting results. The talks in this afternoon This talk will focus on the technical

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services

More information

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross

More information

Optimization for Training I. First-Order Methods Training algorithm

Optimization for Training I. First-Order Methods Training algorithm Optimization for Training I First-Order Methods Training algorithm 2 OPTIMIZATION METHODS Topics: Types of optimization methods. Practical optimization methods breakdown into two categories: 1. First-order

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

Neural Networks: Optimization & Regularization

Neural Networks: Optimization & Regularization Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on Overview of gradient descent optimization algorithms HYUNG IL KOO Based on http://sebastianruder.com/optimizing-gradient-descent/ Problem Statement Machine Learning Optimization Problem Training samples:

More information

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive

More information

Artificial Neuron (Perceptron)

Artificial Neuron (Perceptron) 9/6/208 Gradient Descent (GD) Hantao Zhang Deep Learning with Python Reading: https://en.wikipedia.org/wiki/gradient_descent Artificial Neuron (Perceptron) = w T = w 0 0 + + w 2 2 + + w d d where

More information

Regression. Hung-yi Lee 李宏毅

Regression. Hung-yi Lee 李宏毅 Regression Hung-yi Lee 李宏毅 Regression: Output a scalar Stock Market Forecast f Self-driving Car = Dow Jones Industrial Average at tomorrow f = 方向盤角度 Recommendation f = 使用者 A 商品 B 購買可能性 Example Application

More information

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Neural Networks Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Outline Part 1 Introduction Feedforward Neural Networks Stochastic Gradient Descent Computational Graph

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Machine Learning Basics III

Machine Learning Basics III Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient

More information

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks

More information

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!! CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!! November 18, 2015 THE EXAM IS CLOSED BOOK. Once the exam has started, SORRY, NO TALKING!!! No, you can t even say see ya

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Machine Learning Lecture 14

Machine Learning Lecture 14 Machine Learning Lecture 14 Tricks of the Trade 07.12.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability

More information

Intro to Neural Networks and Deep Learning

Intro to Neural Networks and Deep Learning Intro to Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi UVA CS 6316 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs

More information

Understanding Neural Networks : Part I

Understanding Neural Networks : Part I TensorFlow Workshop 2018 Understanding Neural Networks Part I : Artificial Neurons and Network Optimization Nick Winovich Department of Mathematics Purdue University July 2018 Outline 1 Neural Networks

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole

More information

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks. Nicholas Ruozzi University of Texas at Dallas Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify

More information

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions BACK-PROPAGATION NETWORKS Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks Cannot approximate (learn) non-linear functions Difficult (if not impossible) to design

More information

Advanced Training Techniques. Prajit Ramachandran

Advanced Training Techniques. Prajit Ramachandran Advanced Training Techniques Prajit Ramachandran Outline Optimization Regularization Initialization Optimization Optimization Outline Gradient Descent Momentum RMSProp Adam Distributed SGD Gradient Noise

More information

DeepLearning on FPGAs

DeepLearning on FPGAs DeepLearning on FPGAs Introduction to Deep Learning Sebastian Buschäger Technische Universität Dortmund - Fakultät Informatik - Lehrstuhl 8 October 21, 2017 1 Recap Computer Science Approach Technical

More information

Artificial Neural Networks

Artificial Neural Networks Artificial Neural Networks Oliver Schulte - CMPT 310 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will focus on

More information

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo

More information

Introduction to Neural Networks

Introduction to Neural Networks CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Convolutional Neural Network. Hung-yi Lee

Convolutional Neural Network. Hung-yi Lee al Neural Network Hung-yi Lee Why CNN for Image? [Zeiler, M. D., ECCV 2014] x 1 x 2 Represented as pixels x N The most basic classifiers Use 1 st layer as module to build classifiers Use 2 nd layer as

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Deep Reinforcement Learning. Scratching the surface

Deep Reinforcement Learning. Scratching the surface Deep Reinforcement Learning Scratching the surface Deep Reinforcement Learning Scenario of Reinforcement Learning Observation State Agent Action Change the environment Don t do that Reward Environment

More information

Lecture 6 Optimization for Deep Neural Networks

Lecture 6 Optimization for Deep Neural Networks Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things

More information

CSCI567 Machine Learning (Fall 2018)

CSCI567 Machine Learning (Fall 2018) CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due

More information

Training Neural Networks Practical Issues

Training Neural Networks Practical Issues Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology Fall 2017 Most slides have been adapted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017, and some from

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

4. Multilayer Perceptrons

4. Multilayer Perceptrons 4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen Neural Networks - I Henrik I Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I Christensen (RIM@GT) Neural Networks 1 /

More information

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN) EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN) TARGETED PIECES OF KNOWLEDGE Linear regression Activation function Multi-Layers Perceptron (MLP) Stochastic Gradient Descent

More information

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky. Batch Normalization Devin Willmott University of Kentucky October 23, 2017 Overview 1 Internal Covariate Shift 2 Batch Normalization 3 Implementation 4 Experiments Covariate Shift Suppose we have two distributions,

More information

How to do backpropagation in a brain

How to do backpropagation in a brain How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep

More information

Deep Learning II: Momentum & Adaptive Step Size

Deep Learning II: Momentum & Adaptive Step Size Deep Learning II: Momentum & Adaptive Step Size CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following

More information

Neural Networks biological neuron artificial neuron 1

Neural Networks biological neuron artificial neuron 1 Neural Networks biological neuron artificial neuron 1 A two-layer neural network Output layer (activation represents classification) Weighted connections Hidden layer ( internal representation ) Input

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch. Neural Networks Oliver Schulte - CMPT 726 Bishop PRML Ch. 5 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group Machine Learning for Computer Vision 8. Neural Networks and Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group INTRODUCTION Nonlinear Coordinate Transformation http://cs.stanford.edu/people/karpathy/convnetjs/

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Weihang Chen, Xingchen Chen, Jinxiu Liang, Cheng Xu, Zehao Chen and Donglin He March 26, 2017 Outline What is Stochastic Gradient Descent Comparison between BGD and SGD Analysis

More information

Computational statistics

Computational statistics Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Multilayer Perceptrons (MLPs)

Multilayer Perceptrons (MLPs) CSE 5526: Introduction to Neural Networks Multilayer Perceptrons (MLPs) 1 Motivation Multilayer networks are more powerful than singlelayer nets Example: XOR problem x 2 1 AND x o x 1 x 2 +1-1 o x x 1-1

More information

Artifical Neural Networks

Artifical Neural Networks Neural Networks Artifical Neural Networks Neural Networks Biological Neural Networks.................................. Artificial Neural Networks................................... 3 ANN Structure...........................................

More information

IMPROVING STOCHASTIC GRADIENT DESCENT

IMPROVING STOCHASTIC GRADIENT DESCENT IMPROVING STOCHASTIC GRADIENT DESCENT WITH FEEDBACK Jayanth Koushik & Hiroaki Hayashi Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {jkoushik,hiroakih}@cs.cmu.edu

More information

Lecture 7 Artificial neural networks: Supervised learning

Lecture 7 Artificial neural networks: Supervised learning Lecture 7 Artificial neural networks: Supervised learning Introduction, or how the brain works The neuron as a simple computing element The perceptron Multilayer neural networks Accelerated learning in

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

Course 395: Machine Learning - Lectures

Course 395: Machine Learning - Lectures Course 395: Machine Learning - Lectures Lecture 1-2: Concept Learning (M. Pantic) Lecture 3-4: Decision Trees & CBC Intro (M. Pantic & S. Petridis) Lecture 5-6: Evaluating Hypotheses (S. Petridis) Lecture

More information

Lecture 3: Deep Learning Optimizations

Lecture 3: Deep Learning Optimizations Lecture 3: Deep Learning Optimizations Deep Learning @ UvA UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES DEEP LEARNING OPTIMIZATIONS - 1 Lecture overview o How to define our model and optimize it in practice

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Optimization for neural networks

Optimization for neural networks 0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make

More information

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

More information

Rapid Introduction to Machine Learning/ Deep Learning

Rapid Introduction to Machine Learning/ Deep Learning Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/59 Lecture 4a Feedforward neural network October 30, 2015 2/59 Table of contents 1 1. Objectives of Lecture

More information

CSC321 Lecture 8: Optimization

CSC321 Lecture 8: Optimization CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

Regularization and Optimization of Backpropagation

Regularization and Optimization of Backpropagation Regularization and Optimization of Backpropagation The Norwegian University of Science and Technology (NTNU) Trondheim, Norway keithd@idi.ntnu.no October 17, 2017 Regularization Definition of Regularization

More information

Semi-supervised Learning

Semi-supervised Learning Semi-supervised Learning Introduction Supervised learning: x r, y r R r=1 E.g.x r : image, y r : class labels Semi-supervised learning: x r, y r r=1 R, x u R+U u=r A set of unlabeled data, usually U >>

More information

Slide credit from Hung-Yi Lee & Richard Socher

Slide credit from Hung-Yi Lee & Richard Socher Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step

More information

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others) Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward

More information

Architecture Multilayer Perceptron (MLP)

Architecture Multilayer Perceptron (MLP) Architecture Multilayer Perceptron (MLP) 1 Output Hidden Input Neurons partitioned into layers; y 1 y 2 one input layer, one output layer, possibly several hidden layers layers numbered from 0; the input

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,

More information

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks Topics in Machine Learning-EE 5359 Neural Networks 1 The Perceptron Output: A perceptron is a function that maps D-dimensional vectors to real numbers. For notational convenience, we add a zero-th dimension

More information

Simple neuron model Components of simple neuron

Simple neuron model Components of simple neuron Outline 1. Simple neuron model 2. Components of artificial neural networks 3. Common activation functions 4. MATLAB representation of neural network. Single neuron model Simple neuron model Components

More information

Neural Network Tutorial & Application in Nuclear Physics. Weiguang Jiang ( 蒋炜光 ) UTK / ORNL

Neural Network Tutorial & Application in Nuclear Physics. Weiguang Jiang ( 蒋炜光 ) UTK / ORNL Neural Network Tutorial & Application in Nuclear Physics Weiguang Jiang ( 蒋炜光 ) UTK / ORNL Machine Learning Logistic Regression Gaussian Processes Neural Network Support vector machine Random Forest Genetic

More information

Lecture 4: Perceptrons and Multilayer Perceptrons

Lecture 4: Perceptrons and Multilayer Perceptrons Lecture 4: Perceptrons and Multilayer Perceptrons Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning Perceptrons, Artificial Neuronal Networks Lecture 4: Perceptrons

More information

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections

More information

Neural Networks (Part 1) Goals for the lecture

Neural Networks (Part 1) Goals for the lecture Neural Networks (Part ) Mark Craven and David Page Computer Sciences 760 Spring 208 www.biostat.wisc.edu/~craven/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed

More information

Machine Learning

Machine Learning Machine Learning 10-315 Maria Florina Balcan Machine Learning Department Carnegie Mellon University 03/29/2019 Today: Artificial neural networks Backpropagation Reading: Mitchell: Chapter 4 Bishop: Chapter

More information

EPL442: Computational

EPL442: Computational EPL442: Computational Learning Systems Lab 2 Vassilis Vassiliades Department of Computer Science University of Cyprus Outline Artificial Neuron Feedforward Neural Network Back-propagation Algorithm Notes

More information

AI Programming CS F-20 Neural Networks

AI Programming CS F-20 Neural Networks AI Programming CS662-2008F-20 Neural Networks David Galles Department of Computer Science University of San Francisco 20-0: Symbolic AI Most of this class has been focused on Symbolic AI Focus or symbols

More information

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications Neural Networks Bishop PRML Ch. 5 Alireza Ghane Neural Networks Alireza Ghane / Greg Mori 1 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1 What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1 Multi-layer networks Steve Renals Machine Learning Practical MLP Lecture 3 7 October 2015 MLP Lecture 3 Multi-layer networks 2 What Do Single

More information

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation) Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation

More information

Deep Learning Lab Course 2017 (Deep Learning Practical)

Deep Learning Lab Course 2017 (Deep Learning Practical) Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of

More information

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Neural Networks CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Perceptrons x 0 = 1 x 1 x 2 z = h w T x Output: z x D A perceptron

More information