More Tips for Training Neural Network. Hung-yi Lee
|
|
- Ruby Cox
- 5 years ago
- Views:
Transcription
1 More Tips for Training Neural Network Hung-yi ee
2 Outline Activation Function Cost Function Data Preprocessing Training Generalization
3 Review: Training Neural Network Neural network: f ; θ : input (vector) θ: model parameters (weights, biases) Cost Function: C θ How good your model parameters θ is C θ =, y T C θ C θ = d f ; θ, y y: reference output when is input d f ; θ, y : difference between y and the output of neural network with model parameters θ Target: find θ that minimizes C θ
4 Review: Gradient descent Target: find θ that minimizing C θ C θ =, y T C θ Gradient descent Initialization: θ 0 θ t θ t η C θ t Stochastic gradient descent Pick, y from training data set T θ t θ t η C θ t C θ = C θ l w ij C θ b i l
5 Review: Backpropagation l i l ij l i l ij z w z w C C Forward Pass Backward Pass l l z a l T l l l W z l l l l b a W z y z δ C l l a j l j j i l w ij ayer l ayer l b W z z a T W z l i Error signal for backprop l z i
6 Review: Backpropagation ayer - - z z m z m W T ayer n z z z n C C y C y C y C w y n l ij z w l i l ij δ l C z l i Backward Pass l i z C y T z W l l T l z W Error signal for backprop
7 Outline Activation Function Cost Function Data Preprocessing Training Generalization
8 Problem of Sigmoid Sigmoid Function Derivative of Sigmoid Function Derivative of Sigmoid Function is always smaller than
9 Vanishing Gradient Problem Backward Pass: ayer - ayer - z z m z m W T n z z z n C C y C y C y y n For sigmoid function, σ z always smaller than Error signal is getting smaller and smaller Gradient is smaller C w l ij z w l i l ij C z l i l i
10 Vanishing Gradient Problem Input ayer ayer ayer Output y vector 3 y vector y earn very slowly Still random earn faster Already converge The weights are converged based on random!?
11 ReU PReU a a = z Rectified inear Unit (ReU) a = αz z σ z a a = z σ z a a = 0 z 0 z
12 ReU Backward Pass: ayer - - z z m z m W T ayer n z z z n C C y C y C y y n
13 ReU Backward Pass: ayer - No Attenuation - 0 m 0 W T ayer 0 n 0 C C y C y C y y n
14 ReU Backward Pass: n C y y C n y C ayer y C n n z C n nj n nj z w z w C C All the weights connected to this neuron will have zero gradient.
15 Maout ReU or PReu is just a special case of Maout Input + z + z Ma ma z, z a + z + z Ma a z = W + z 3 + z 4 Ma a + z 3 + z 4 Ma a a z z z = W a a
16 Maout ReU is special case Input w b z ReU a Input 0 w 0 b + z Ma a + z ma z, z a z = w + b a z = w + b z =0
17 Maout ReU is special case Input w b z ReU a a Input w w b z = w + b b + z Ma a + z ma z, z earnable Activation Function a z = w + b z = w + b
18 Maout - Training by Backpropagation In forward pass, we know which z would be the ma Input + z + z Ma a ma z, z + z + z Ma a + z 3 + z 3 + z 4 Ma a + z 4 Ma a a a
19 Maout - Training by Backpropagation In forward pass, we know which z would be the ma Input + z + z a + z + z a + z 3 a + z 3 + z 4 a + z 4 a In backward pass, treat it as network whose neurons having linear activation functions a
20 Outline Activation Function Cost Function Data Preprocessing Training Generalization
21 Problem of Sigmoid - z z z z C C y C y y y ŷ y ŷ C y y yˆ y n y n n ˆ arger difference, arger gradient m z m W T n z n C y n y ˆ n y n 0 arge difference z tiny derivative How about ReU? z
22 ReU is probably not suitable for output layer y = 0 0 Only one dimension is, and others are all 0 ReU More similar? y = ReU y = arger output means larger confidence Better? It is better to let the output bounded.
23 Softma Softma layer as the output layer Ordinary Output layer z y z z y z z 3 y3 z3
24 Softma Softma layer as the output layer Softma ayer Probability: > y i > 0 i y i = z z z 3 3 e z e.7-3 e z e 0 e z e 3 j 0.05 z j e 0.88 y 0. y 0 y 3 e e e z z z 3 3 j 3 j 3 j e e e z z z j j j
25 Softma Define cost function y i j e z i e z j Minimize C θ = C θ = logy t, y T C θ z z z i Softma y y y i ŷ ŷ ŷ i y = 0 0 Only one dimension is, and others are all 0 t Inde of the dimension which is
26 Softma Define cost function z z y z i i j Softma e z i e z j y y y i Minimize C θ = C θ = logy t y ŷ y y = ŷ y t ŷ i, y T y = C θ Cross Entropy 0 0 Increase y t Don t we have to decrease other dimensions?
27 y i i = t j e z i e C θ = logy t y tˆ tˆ z C z j tˆ j C y e z tˆ e tˆ z j y z tˆ tˆ yˆ t i y z y i tˆ tˆ C z i z z z i yˆ t y z t appears in both numerator and denominator tˆ Softma y tˆ y tˆ y y y i yˆ t ŷ ŷ ŷ i The absolute value of δ t is larger when y t is far from
28 y i i t y i tˆ j e z i e C θ = logy t z j C y C z j e i z tˆ tˆ e y z z j tˆ i yˆ t i y i yt ˆ zi C z i 0 z z z i yˆ t z i appears only in denominator Softma yˆ y t i y y y i yi 0 ŷ ŷ ŷ i The absolute value of δ i is larger when y i is larger
29 Outline Activation Function Cost Function Data Preprocessing Training Generalization
30 Data Preprocessing For each dimension i: mean: m i standard deviation: σ i 3 r R r i i r m i The means of all dimensions are 0, σ i and the variances are all
31 Outline Activation Function Cost Function Data Preprocessing Training Generalization
32 Outline - Training Activation Function Cost Function Data Preprocessing Momentum Training Generalization earning Rates
33 Outline - Training Activation Function Cost Function Data Preprocessing Momentum Training Generalization earning Rates
34 Problem of Gradient descent cost Very slow at the plateau Gradient is small Stuck at local minima Gradient is Zero parameter space
35 In physical world Momentum How about put this phenomenon in gradient descent?
36 Original Gradient descent C θ 0 θ 0 θ Gradient Movement C θ θ θ 3 C θ C θ 3 Start at position θ 0 Compute gradient at θ 0 Move to θ = θ 0 - μ C θ 0 Compute gradient at θ Move to θ = θ μ C θ
37 Gradient descent with Momentum C θ 0 C θ θ 0 θ Gradient Movement Momentum θ θ 3 Start at point θ 0 Momentum v 0 =0 (v i : movement at the i-th update) Compute gradient at θ C θ 0 Momentum v = λv 0 - μ C θ 0 Move to θ = θ 0 + v Compute gradient at θ C θ 3 Momentum v = λv - μ C θ Move to θ = θ + v
38 Gradient descent with Momentum v i is actually the weighted sum of all the previous gradient: C θ 0, C θ, C θ i v 0 =0 v = - μ C θ 0 v = - λ μ C θ 0 - μ C θ Start at point θ 0 Momentum v 0 =0 (v i : movement at the i-th update) Compute gradient at θ 0 Momentum v = λv 0 - μ C θ 0 Move to θ = θ 0 + v Compute gradient at θ Momentum v = λv - μ C θ Move to θ = θ + v
39 Gradient descent with cost Momentum Problem Gradient Movement Momentum Gradient = 0 parameter space
40 NAG Momentum Gradient Movement Momentum Nesterov s Accelerated Gradient (NAG) Gradient = 0 Gradient = 0
41 Gradient descent, Momentum, NAG Methods: Gradient descent Momentum NAG Source:
42 Outline - Training Activation Function Cost Function Data Preprocessing Momentum Training Generalization earning Rates
43 earning Rates cost Very arge arge Just make small epoch Error Surface
44 earning Rates Popular & Simple Idea: Reduce the learning rate by some factor every few epochs. At the beginning, we are far from a minimum, so we use larger learning rate After several epochs, we are close to a minimum, so we reduce the learning rate E.g. /t decay: η t = η (t + ) Not always true
45 Adaptive earning Rates Each parameter should have different learning rates Smaller earning Rate arger Gradient w w arger earning Rate Smaller earning Rate
46 RProp g t : gradient obtained with w t Each parameter should have different learning rates Gradient descent: w t+ w t ηg t If g t > 0 Another Aspect: - If g t < 0 The learning rate is η g t. wt+ w t η g t gt
47 RProp - Problem C w = E[ C w ] RProp is problematic for stochastic gradient descent Gradient descent: w t+ w t η C 0. w Stochastic Gradient descent: w t+ w t η C w
48 Adagrad Divide the learning rate by average gradient w t+ w t η σ gt σ: Average gradient of parameter w Estimated while updating the parameters If w has small average gradient σ If w has large average gradient σ arger learning rate Smaller learning rate
49 Adagrad w w 0 η σ 0 g0 σ 0 = g 0 w w η σ g σ = g0 + g w 3 w η σ g σ = 3 g0 + g + g w t+ w t η gt σt σ t = t t + i=0 g i
50 Adagrad Divide the learning rate by average gradient The average gradient is obtained while updating the parameters η t w t+ w t η σ t gt σ t = t t + i=0 g i η t = η t + /t decay w t+ w t η t i=0 g i g t
51 RMSProp Error Surface can be comple in deep learning. Smaller earning Rate w arger earning Rate w
52 RMSProp w w 0 η σ 0 g0 w w η σ g σ 0 = g 0 σ = α σ 0 + α g w 3 w η σ g σ = α σ + α g w t+ w t η gt σt σ t = α σ t + α g t
53 izing_gradient_optimization_techniques/cklhott (By Alec Radford)
54 izing_gradient_optimization_techniques/cklhott (By Alec Radford)
55 Not the whole story Adadelta ogletr0.pdf Adam AdaSecant No more pesky learning rates Using second order information
56 Outline Activation Function Cost Function Data Preprocessing Training Generalization
57 Outline - Generalization Activation Function Cost Function Data Preprocessing Training Weight Decay Generalization Dropout
58 Outline - Generalization Activation Function Cost Function Data Preprocessing Training Weight Decay Generalization Dropout
59 Weight Decay Our Brain
60 Weight Decay The parameters closer to zero is preferred. Training data:, yˆ, } { i w w w i z z w yˆ z Testing data:, yˆ, } { z' To minimize the effect of noise, we want w close to zero. w w w z w
61 Weight Decay New cost function to be minimized Find a set of weight not only minimizing original cost but also close to zero C C Original cost (e.g. minimize square error, cross entropy ) Regularization term: W, W, w w w w (not consider biases. why?)
62 Weight Decay New cost function to be minimized Gradient: w w w C C Update: w w w t t C t t w w w C w t w C Smaller and smaller w w w w C C
63 Outline - Generalization Activation Function Cost Function Data Preprocessing Training Weight Decay Generalization Dropout
64 Dropout θ t θ t η C θ t Training: (stochastic gradient descent) In each iteration Each neuron has p% to dropout
65 Dropout θ t θ t η C θ t Training: (stochastic gradient descent) Thinner! In each iteration Each neuron has p% to dropout The structured of the network is changed. Using the new network for training For each iteration, we resample the dropout neurons
66 Dropout Testing: No dropout If the dropout rate at training is p%, all the weights times (-p)% Assume that the dropout rate is 50%. l l If w ij = from training, set w ij = 0.5 for testing.
67 Dropout - Intuitive Reason 我的 partner 會擺爛, 所以我要好好做 When teams up, if everyone epect the partner will do the work, nothing will be done finally. However, if you know your partner will dropout, you will do better. When testing, no one dropout actually, so obtaining good results eventually.
68 Dropout - Intuitive Reason Why the weights should multiply (-p)% (dropout rate) when testing? Training of Dropout Testing of Dropout Assume dropout rate is 50% w w w 3 w 4 z No dropout w w w 3 w 4 Weights from training z z z Weights multiply (-p)% z z
69 Dropout - Ensemble Ensemble Training Set Set Set Set 3 Set 4 Network Network Network 3 Network 4 Train a bunch of networks with different structures
70 Dropout - Ensemble Ensemble Testing data Network Network Network 3 Network 4 y y y 3 y 4 average
71 Dropout - Ensemble Dropout Ensemble. Training of Dropout 3 4 M neurons M possible networks Using one data to train one network Some parameters in the network are shared
72 Dropout - Ensemble Testing of Dropout testing data Dropout Ensemble. All the weights multiply (-p)% y y y 3 average? y
73 Dropout - Ensemble Eperiments on hand writing digital classification Ref:
74 Practical Suggestion for Dropout arger network If you know your task need n neurons, for dropout rate p, your network need n/(-p) neurons. onger training time Higher learning rate arger momentum
75 Not the whole story To learn more about the theory of dropout Baldi, Pierre, and Peter J. Sadowski. "Understanding dropout." Advances in Neural Information Processing Systems. 03. DropConnect Reference: Wan,., Zeiler, M., Zhang, S., Cun, Y.., & Fergus, R. (03). Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on Machine earning (ICM-3) Standout Reference: Ba, J., & Frey, B. (03). Adaptive dropout for training deep neural networks. In Advances in Neural Information Processing Systems
76 Concluding Remarks Outlook: How to connect neurons Parameters Initialization Activation Function Cost Function Data Preprocessing Training Generalization ReU, Maout Softma + Cross Entropy Mean=0, Variance= Momentum, NAG, Adagrad, RMSProp Weight decay, Dropout
77 Acknowledgement 感謝賴威昇同學糾正投影片上的錯誤
78 Appendi
79 Optimizaton method ahiriprochnowng0.pdf For RNN kever3.pdf
80 Kludge? The smallness of the weights means that the behaviour of the network won't change too much if we change a few random inputs here and there. That makes it difficult for a regularized network to learn the effects of local noise in the data. Simple is not always better If one had judged merely on the grounds of simplicity, then some modified form of Newton's theory would arguably have been more attractive. the dynamics of gradient descent learning in multilayer nets has a `self-regularization' effect"
81 Use genetic approach
82 Story of NN ents/5lnbt/ama_yann_lecun/chivdv7
83 Reference Everything about tuing parameters
84 Activation Functions a Softplus: a ln e z a ma a e z Softsign: a a e e z z e e z z z
85 RProp Each parameter has different learning rate θ i θ i η i C θ i If at t- and t-th iterations, C θ i has the same sign Otherwise: - η i : parameter dependent η i η i 0.5 η i η i. (decrease) If C θ i > 0 If C θ i < 0 (increase)
86 RProp Original Gradient descent RProp 0.0 ~ 0.3k updates updates
87 Adadelta
88 MNIST TIMIT
89 Error Surface: Square Error v.s. Cross Entropy Red: Square Error Black: Cross Entropy Cost w w Ref: Glorot, X., & Bengio, Y. (00). Understanding the difficulty of training deep feedforward neural networks. In International conference on artificial intelligence and statistics
90 Reference Mean Variance Decorrelation
91 ReU Backward Pass: l δ ayer l l z l z ayer l + δ l z z l l ayer - - z z ayer z z C C y C y y i l z i l W T k l z k m z m W T n z n C y n
92 Start at point θ 0 Momentum v 0 =0 (Movement at the last step) Compute gradient at θ 0 Momentum v = λv 0 - μ C θ 0 Move to θ = θ 0 + v Compute gradient at θ Momentum v = λv - μ C θ First make a big jump in the direction of the previous accumulated gradient. Move to θ = θ + v Then measure the gradient where you end up and make a correction. Its better to correct a mistake after you have made it! Start at point θ 0 Momentum v 0 =0 (Movement at the last step) Compute gradient at θ 0 Momentum v = λv 0 - μ C θ 0 + λv 0 Move to θ = θ 0 + v Compute gradient at θ Momentum v = λv - μ C θ + λv Move to θ = θ + v Compute gradient at θ Momentum v 3 = λv - μ C θ + λv Move to θ 3 = θ + v 3
93 Start at point θ 0 Momentum v 0 =0 (Movement at the last step) Compute gradient at θ 0 Momentum v = λv 0 - μ C θ 0 + λv 0 Move to θ = θ 0 + v Compute gradient at θ Momentum v = λv - μ C θ + λv Move to θ = θ + v Compute gradient at θ Momentum v 3 = λv - μ C θ + λv Move to θ 3 = θ + v 3 Start at point θ 0 Momentum v 0 =0 (Movement at the last step) Momentum v =0- μ C θ 0 Move to θ = θ 0 + v Move to θ = θ + λv Momentum v = λv - μ C θ Move to θ = θ + v Momentum v 3 = λv - μ C θ + Move to θ 3 = θ + v 3
94 Why Zero Mean? w C w j j C z l i the same for all j Input w ayer The inputs are always positive j w j i w w Sigmoid <? Hyperbolic Tangent Initial: target:
Tips for Deep Learning
Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results
More informationTips for Deep Learning
Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results
More informationDeep Learning Tutorial. 李宏毅 Hung-yi Lee
Deep Learning Tutorial 李宏毅 Hung-yi Lee Outline Part I: Introduction of Deep Learning Part II: Why Deep? Part III: Tips for Training Deep Neural Network Part IV: Neural Network with Memory Part I: Introduction
More informationBased on the original slides of Hung-yi Lee
Based on the original slides of Hung-yi Lee New Activation Function Rectified Linear Unit (ReLU) σ z a a = z Reason: 1. Fast to compute 2. Biological reason a = 0 [Xavier Glorot, AISTATS 11] [Andrew L.
More informationDeep Learning. Hung-yi Lee 李宏毅
Deep Learning Hung-yi Lee 李宏毅 Deep learning attracts lots of attention. I believe you have seen lots of exciting results before. Deep learning trends at Google. Source: SIGMOD 206/Jeff Dean 958: Perceptron
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationDeep learning attracts lots of attention.
Deep Learning Deep learning attracts lots of attention. I believe you have seen lots of exciting results before. Deep learning trends at Google. Source: SIGMOD/Jeff Dean Ups and downs of Deep Learning
More informationDeep Learning & Artificial Intelligence WS 2018/2019
Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target
More informationDeep Learning Tutorial. 李宏毅 Hung-yi Lee
Deep Learning Tutorial 李宏毅 Hung-yi Lee Deep learning attracts lots of attention. Google Trends Deep learning obtains many exciting results. The talks in this afternoon This talk will focus on the technical
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationBased on the original slides of Hung-yi Lee
Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services
More informationCSC 578 Neural Networks and Deep Learning
CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross
More informationOptimization for Training I. First-Order Methods Training algorithm
Optimization for Training I First-Order Methods Training algorithm 2 OPTIMIZATION METHODS Topics: Types of optimization methods. Practical optimization methods breakdown into two categories: 1. First-order
More informationMachine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function
More informationNeural Networks: Optimization & Regularization
Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg
More informationStatistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks
Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered
More informationOverview of gradient descent optimization algorithms. HYUNG IL KOO Based on
Overview of gradient descent optimization algorithms HYUNG IL KOO Based on http://sebastianruder.com/optimizing-gradient-descent/ Problem Statement Machine Learning Optimization Problem Training samples:
More informationEve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates
Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive
More informationArtificial Neuron (Perceptron)
9/6/208 Gradient Descent (GD) Hantao Zhang Deep Learning with Python Reading: https://en.wikipedia.org/wiki/gradient_descent Artificial Neuron (Perceptron) = w T = w 0 0 + + w 2 2 + + w d d where
More informationRegression. Hung-yi Lee 李宏毅
Regression Hung-yi Lee 李宏毅 Regression: Output a scalar Stock Market Forecast f Self-driving Car = Dow Jones Industrial Average at tomorrow f = 方向盤角度 Recommendation f = 使用者 A 商品 B 購買可能性 Example Application
More informationNeural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016
Neural Networks Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Outline Part 1 Introduction Feedforward Neural Networks Stochastic Gradient Descent Computational Graph
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationCSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More informationMachine Learning Basics III
Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient
More informationDeep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation
Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks
More informationCSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!
CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!! November 18, 2015 THE EXAM IS CLOSED BOOK. Once the exam has started, SORRY, NO TALKING!!! No, you can t even say see ya
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationMachine Learning Lecture 14
Machine Learning Lecture 14 Tricks of the Trade 07.12.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability
More informationIntro to Neural Networks and Deep Learning
Intro to Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi UVA CS 6316 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs
More informationUnderstanding Neural Networks : Part I
TensorFlow Workshop 2018 Understanding Neural Networks Part I : Artificial Neurons and Network Optimization Nick Winovich Department of Mathematics Purdue University July 2018 Outline 1 Neural Networks
More informationStatistical Machine Learning from Data
January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole
More informationNeural Networks. Nicholas Ruozzi University of Texas at Dallas
Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify
More informationSerious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions
BACK-PROPAGATION NETWORKS Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks Cannot approximate (learn) non-linear functions Difficult (if not impossible) to design
More informationAdvanced Training Techniques. Prajit Ramachandran
Advanced Training Techniques Prajit Ramachandran Outline Optimization Regularization Initialization Optimization Optimization Outline Gradient Descent Momentum RMSProp Adam Distributed SGD Gradient Noise
More informationDeepLearning on FPGAs
DeepLearning on FPGAs Introduction to Deep Learning Sebastian Buschäger Technische Universität Dortmund - Fakultät Informatik - Lehrstuhl 8 October 21, 2017 1 Recap Computer Science Approach Technical
More informationArtificial Neural Networks
Artificial Neural Networks Oliver Schulte - CMPT 310 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will focus on
More informationDEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY
DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo
More informationIntroduction to Neural Networks
CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationConvolutional Neural Network. Hung-yi Lee
al Neural Network Hung-yi Lee Why CNN for Image? [Zeiler, M. D., ECCV 2014] x 1 x 2 Represented as pixels x N The most basic classifiers Use 1 st layer as module to build classifiers Use 2 nd layer as
More informationApprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning
Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire
More informationIntroduction to Convolutional Neural Networks (CNNs)
Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei
More informationDeep Reinforcement Learning. Scratching the surface
Deep Reinforcement Learning Scratching the surface Deep Reinforcement Learning Scenario of Reinforcement Learning Observation State Agent Action Change the environment Don t do that Reward Environment
More informationLecture 6 Optimization for Deep Neural Networks
Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things
More informationCSCI567 Machine Learning (Fall 2018)
CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due
More informationTraining Neural Networks Practical Issues
Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology Fall 2017 Most slides have been adapted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017, and some from
More informationComments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms
Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More information4. Multilayer Perceptrons
4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationLecture 17: Neural Networks and Deep Learning
UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions
More informationIntroduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen
Neural Networks - I Henrik I Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I Christensen (RIM@GT) Neural Networks 1 /
More informationEVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)
EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN) TARGETED PIECES OF KNOWLEDGE Linear regression Activation function Multi-Layers Perceptron (MLP) Stochastic Gradient Descent
More informationInternal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.
Batch Normalization Devin Willmott University of Kentucky October 23, 2017 Overview 1 Internal Covariate Shift 2 Batch Normalization 3 Implementation 4 Experiments Covariate Shift Suppose we have two distributions,
More informationHow to do backpropagation in a brain
How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep
More informationDeep Learning II: Momentum & Adaptive Step Size
Deep Learning II: Momentum & Adaptive Step Size CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following
More informationNeural Networks biological neuron artificial neuron 1
Neural Networks biological neuron artificial neuron 1 A two-layer neural network Output layer (activation represents classification) Weighted connections Hidden layer ( internal representation ) Input
More informationNeural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann
Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable
More informationA summary of Deep Learning without Poor Local Minima
A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given
More informationFeed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.
Neural Networks Oliver Schulte - CMPT 726 Bishop PRML Ch. 5 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will
More informationCS260: Machine Learning Algorithms
CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {
More informationMachine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group
Machine Learning for Computer Vision 8. Neural Networks and Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group INTRODUCTION Nonlinear Coordinate Transformation http://cs.stanford.edu/people/karpathy/convnetjs/
More informationStochastic Gradient Descent
Stochastic Gradient Descent Weihang Chen, Xingchen Chen, Jinxiu Liang, Cheng Xu, Zehao Chen and Donglin He March 26, 2017 Outline What is Stochastic Gradient Descent Comparison between BGD and SGD Analysis
More informationComputational statistics
Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationMultilayer Perceptrons (MLPs)
CSE 5526: Introduction to Neural Networks Multilayer Perceptrons (MLPs) 1 Motivation Multilayer networks are more powerful than singlelayer nets Example: XOR problem x 2 1 AND x o x 1 x 2 +1-1 o x x 1-1
More informationArtifical Neural Networks
Neural Networks Artifical Neural Networks Neural Networks Biological Neural Networks.................................. Artificial Neural Networks................................... 3 ANN Structure...........................................
More informationIMPROVING STOCHASTIC GRADIENT DESCENT
IMPROVING STOCHASTIC GRADIENT DESCENT WITH FEEDBACK Jayanth Koushik & Hiroaki Hayashi Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {jkoushik,hiroakih}@cs.cmu.edu
More informationLecture 7 Artificial neural networks: Supervised learning
Lecture 7 Artificial neural networks: Supervised learning Introduction, or how the brain works The neuron as a simple computing element The perceptron Multilayer neural networks Accelerated learning in
More informationTutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba
Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation
More informationCourse 395: Machine Learning - Lectures
Course 395: Machine Learning - Lectures Lecture 1-2: Concept Learning (M. Pantic) Lecture 3-4: Decision Trees & CBC Intro (M. Pantic & S. Petridis) Lecture 5-6: Evaluating Hypotheses (S. Petridis) Lecture
More informationLecture 3: Deep Learning Optimizations
Lecture 3: Deep Learning Optimizations Deep Learning @ UvA UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES DEEP LEARNING OPTIMIZATIONS - 1 Lecture overview o How to define our model and optimize it in practice
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationOptimization for neural networks
0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make
More informationLecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning
Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function
More informationRapid Introduction to Machine Learning/ Deep Learning
Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/59 Lecture 4a Feedforward neural network October 30, 2015 2/59 Table of contents 1 1. Objectives of Lecture
More informationCSC321 Lecture 8: Optimization
CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:
More informationRegularization and Optimization of Backpropagation
Regularization and Optimization of Backpropagation The Norwegian University of Science and Technology (NTNU) Trondheim, Norway keithd@idi.ntnu.no October 17, 2017 Regularization Definition of Regularization
More informationSemi-supervised Learning
Semi-supervised Learning Introduction Supervised learning: x r, y r R r=1 E.g.x r : image, y r : class labels Semi-supervised learning: x r, y r r=1 R, x u R+U u=r A set of unlabeled data, usually U >>
More informationSlide credit from Hung-Yi Lee & Richard Socher
Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step
More informationMachine Learning. Neural Networks. (slides from Domingos, Pardo, others)
Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward
More informationArchitecture Multilayer Perceptron (MLP)
Architecture Multilayer Perceptron (MLP) 1 Output Hidden Input Neurons partitioned into layers; y 1 y 2 one input layer, one output layer, possibly several hidden layers layers numbered from 0; the input
More informationData Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,
More informationSPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks
Topics in Machine Learning-EE 5359 Neural Networks 1 The Perceptron Output: A perceptron is a function that maps D-dimensional vectors to real numbers. For notational convenience, we add a zero-th dimension
More informationSimple neuron model Components of simple neuron
Outline 1. Simple neuron model 2. Components of artificial neural networks 3. Common activation functions 4. MATLAB representation of neural network. Single neuron model Simple neuron model Components
More informationNeural Network Tutorial & Application in Nuclear Physics. Weiguang Jiang ( 蒋炜光 ) UTK / ORNL
Neural Network Tutorial & Application in Nuclear Physics Weiguang Jiang ( 蒋炜光 ) UTK / ORNL Machine Learning Logistic Regression Gaussian Processes Neural Network Support vector machine Random Forest Genetic
More informationLecture 4: Perceptrons and Multilayer Perceptrons
Lecture 4: Perceptrons and Multilayer Perceptrons Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning Perceptrons, Artificial Neuronal Networks Lecture 4: Perceptrons
More informationDeep Feedforward Networks. Seung-Hoon Na Chonbuk National University
Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections
More informationNeural Networks (Part 1) Goals for the lecture
Neural Networks (Part ) Mark Craven and David Page Computer Sciences 760 Spring 208 www.biostat.wisc.edu/~craven/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed
More informationMachine Learning
Machine Learning 10-315 Maria Florina Balcan Machine Learning Department Carnegie Mellon University 03/29/2019 Today: Artificial neural networks Backpropagation Reading: Mitchell: Chapter 4 Bishop: Chapter
More informationEPL442: Computational
EPL442: Computational Learning Systems Lab 2 Vassilis Vassiliades Department of Computer Science University of Cyprus Outline Artificial Neuron Feedforward Neural Network Back-propagation Algorithm Notes
More informationAI Programming CS F-20 Neural Networks
AI Programming CS662-2008F-20 Neural Networks David Galles Department of Computer Science University of San Francisco 20-0: Symbolic AI Most of this class has been focused on Symbolic AI Focus or symbols
More informationNeural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications
Neural Networks Bishop PRML Ch. 5 Alireza Ghane Neural Networks Alireza Ghane / Greg Mori 1 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of
More informationArtificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino
Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as
More informationCSC321 Lecture 9: Generalization
CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions
More informationWhat Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1
What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1 Multi-layer networks Steve Renals Machine Learning Practical MLP Lecture 3 7 October 2015 MLP Lecture 3 Multi-layer networks 2 What Do Single
More information<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)
Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation
More informationDeep Learning Lab Course 2017 (Deep Learning Practical)
Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of
More informationNeural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Neural Networks CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Perceptrons x 0 = 1 x 1 x 2 z = h w T x Output: z x D A perceptron
More information