Understanding Neural Networks : Part I
|
|
- Simon McLaughlin
- 5 years ago
- Views:
Transcription
1 TensorFlow Workshop 2018 Understanding Neural Networks Part I : Artificial Neurons and Network Optimization Nick Winovich Department of Mathematics Purdue University July 2018
2 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation
3 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation
4 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation
5 Artificial Neural Networks Neural networks are a class of simple, yet effective, computing systems with a diverse range of applications. In these systems, small computational units, or nodes, are arranged to form networks in which connectivity is leveraged to carry out complex calculations. Deep Learning by Goodfellow, Bengio, and Courville: Convolutional Neural Networks for Visual Recognition at Stanford:
6 Artificial Neurons Diagram modified from Stack Exchange post answered by Gonzalo Medina. Weights are first used to scale inputs; the results are summed with a bias term and passed through an activation function.
7 Formula and Vector Representation The diagram from the previous slide can be interpreted as: y = f ( x 1 w 1 + x 2 w 2 + x 3 w 3 + b ) which can be conveniently represented in vector form via: y = f ( w T x + b ) by interpreting the neuron inputs and weights as column vectors.
8 Artificial Neurons: Multiple Outputs
9 Matrix Representation This corresponds to a pair of equations, one for each ouput: y 1 = f ( w T 1 x + b 1 ) y 2 = f ( w T 2 x + b 2 ) which can be represented in matrix form by the system: y = f ( W x + b ) where we assume the activation function has been vectorized.
10 Fully-Connected Neural Layers The resulting layers, referred to as fully-connected or dense, can be visualized as a collection of nodes connected by edges corresponding to weights (bias/activations are typically omitted)
11 Floating Point Operation Count Mult: MN Matrix-Vector Multiplication w w 1N x w M1... w MN x N w 11 x 1... w 1N x N..... w M1 x 1... w MN x N Add: M(N 1) w 11 x w 1N x N... w M1 x w MN x N
12 Floating Point Operation Count So we see that when bias terms are omitted, the FLOPs required for a neural connection between N inputs and M outputs is: 2 MN M = MN multiplies + M(N 1) adds When bias terms are included, an additional M addition operations are required, resulting in a total of 2 MN FLOPs. Note: This omits the computation required for applying the activation function to M values resulting from the linear operations. Depending on the activation function selected, this may or may not have a significant impact on the overall computational complexity.
13 Activation Functions Activation functions are a fundamental component of neural network architectures; these functions are responsible for: Providing all of the network s non-linear modeling capacity Controlling the gradient flows that guide the training process While activation functions play a fundamental role in all neural networks, it is still desirable to limit their computational demands (e.g. avoid defining them in terms of a Krylov subspace method...). In practice, activations such as rectified linear units (ReLUs) with the most trivial function and derivative definitions often suffice.
14 Activation Functions Rectified Linear Unit (ReLU) { x x 0 f(x) = 0 x < 0 SoftPlus Activation f(x) = ln ( 1 + exp( x) )
15 Activation Functions Sigmoidal Unit Hyperbolic Tangent Unit f(x) = exp( x) f(x) = tanh(x)
16 Activation Functions (Parameterized) Exponential Linear Unit (ELU) Leaky Rectified Linear Unit { { f α (x) = x x 0 x x 0 α (e x f α (x) = 1) x < 0 α x x < 0
17 Activation Functions (Learnable Parameters) Parameterized ReLU { β x x 0 f β (x) = x x < 0 f β (x) = Swish Units x 1 + exp( β x)
18 Hidden Layers Intermediate, or hidden, layers can be added between the input and ouput nodes to allow for additional non-linear processing. For example, we can first define a layer such as: h = f 1 ( W 1 x + b 1 ) and construct a subsequent layer to produce the final output: y = f 2 ( W 2 h + b 2 )
19 Hidden Layers
20 Multiple Hidden Layers
21 Multiple Hidden Layers Multiple hidden layers can easily be defined in the same way: h 1 = f 1 ( W 1 x + b 1 ) h 2 = f 2 ( W 2 h 1 + b 2 ) y = f 3 ( W 3 h 2 + b 3 ) One of the challenges of working with additional layers is the need to determine the impact that earlier layers have on the final ouput. This will be necessary for tuning/optimizing network parameters (i.e. weights and biases) to produce accurate predictions.
22 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation
23 Universal Approximators: Cybenko (1989) Cybenko, G., Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), pp Basic Idea of Result: Let I n denote the unit hypercube in R n ; the collection of functions which can be expressed in the form: N i=1 α i σ ( w T i x + b i ) x In is dense in the space of continuous functions C(I n ) defined on I n : i.e. f C(I n ), ε > 0 there exist constants N, α i, w i, b i such that f(x) N α i σ(w T i=1 i x + b i ) < ε x I n
24 Universal Approximators: Hornik et al. / Funahashi Hornik, K., Stinchcombe, M. and White, H., Multilayer feedforward networks are universal approximators. Neural networks, 2(5), pp Funahashi, K.I., On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3), pp Summary of Results: For any compact set K R n, multi-layer feedforward neural networks are dense in the space of continuous funtions C(K) on K, with respect to the supremum norm, provided that the activation function used for the network layers is: Continuous and increasing Non-constant and bounded
25 Universal Approximators: Leshno et al. (1992) Leshno, M., Lin, V.Y., Pinkus, A. and Schocken, S., Multilayer feedforward networks with a non-polynomial activation function can approximate any function. A standard multilayer feedforward network with a locally bounded piecewise continuous activation function can approximate any continuous function to any degree of accuracy if and only if the network s activation function is not a polynomial. (Leshno et al.) Here the notion of approximation is also defined in terms of the supremum norm, and the domains are assumed to be compact The result does not hold without thresholds (i.e. bias terms)
26 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation
27 Overfitting In some cases the network is capable learning too much from the specific training data used; this phenomenon is referred to as overfitting and occurs when the model performs well on the training dataset, but does not generalize to accurate predictions on data which has not been seen during training. Consider, for example:
28 L1 and L2 Weight Regularization One simple technique to help avoid overfitting is to add a penalty for network parameters with large L1 or L2 norms. This is similar to the underlying idea behind LASSO regression and can be loosely interpreted as a form of applying the principle of Ockham s Razor: i.e. the simplest solution often turns out to be the correct solution. L2 regularization is a fairly general regularization technique which places an emphasis on reducing the largest weights L1 regularization helps to encourage sparsity in the network and improves performance when the problem has a sparse solution
29 Applying Dropout Applying dropout to hidden network layers also helps to avoid overfitting. This technique consists of removing, or dropping, units/nodes randomly at each step of the training process. A fixed drop rate p (0, 1) is specified prior to training, and nodes in the layer are dropped according to a collection of i.i.d. random Bernoulli samples drawn at each training step Since all nodes will be used after training, the outputs of the remaining nodes are rescaled by a factor of 1/(1 p) to ensure that the expected values during training and testing coincide Loosely speaking, this can be thought of as a way to ensure that no individual node plays too large of a role in the final prediction.
30 Example: Dropout with Rate = 0.25 [ Training ]
31 Example: Dropout with Rate = 0.25 [ Training ]
32 Example: Dropout with Rate = 0.25 [ Training ]
33 Example: Dropout with Rate = 0.25 [ Testing ]
34 Motivation for Batch Normalization Ioffe, S. and Szegedy, C., Batch normalization: Accelerating deep network training by reducing internal covariate shift. arxiv preprint arxiv: Internal Covariate Shift: As network parameters change during training, the distributions of the input values to each layer change. Training could be more efficient if the layers were receiving inputs with a fixed distribution throughout the entire process Achieving this using normalization requires a technique which is compatabile with gradient-based optimization
35 Batch Normalization The proposed batch normalization technique corresponds to first performing a normalization with respect to the batch statistics: x = x µ B σ 2 B + ε with µ B = 1 m x, σb 2 = 1 m x B (x µ B ) 2 x B where m is a fixed batch size, and ε > 0 for numerical stability. A linear map with learnable parameters γ and β is then applied: y i = γ x i + β and the normalized values {y i } are passed to the activation function to apply the non-linear transformation for the layer.
36 Batch Normalization after Training After training, we need a way to freeze the model in place for making predictions. This is accomplished by specifying a fixed normalization rule for each layer; rather than use sample statistics from a specific batch, it is natural to incorporate the entire dataset: x = x µ σ 2 + ε where µ is the empirical mean E D [x] and σ 2 is the variance Var D [x] taken with respect to the complete training dataset D. These values can be tracked using moving averages during training to avoid direct computation and provide accurate estimates when parameter changes are small near the end of the training process.
37 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation
38 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation
39 Evaluating a Network Up until now, we have still not discussed how to quantify the performance of a neural network. The most common strategy for quantifying performance is to define loss functions for the network with low loss corresponding to high performance. In supervised training, where the true labels/solutions are known, the network loss function is typically composed of: A primary loss term corresponding to a measure of how close the network predictions are to the true solutions Auxiliary loss components, such as weight regularization penalties, designed to help guide the training process Once the network performance is quantified, we can specify an optimization algorithm designed to minimize the network loss.
40 Loss Functions Two of the most common applications of neural networks are regression (for predicting continuous properties/values) and classification (for predicting discrete properties or labels). For regression, a standard loss is given by the mean squared error: Loss = 1 I (ŷ i y i ) 2 i I where I are the indices of the output data (e.g. pixels of an image). For classification, softmax cross entropy can be used when labels are mutually exclusive (e.g. classifying a digit as 0, 1, 2 etc.); sigmoid cross entropy can be used when labels are not mutually exclusive (e.g. determining which objects are in an image).
41 One Hot Encoding It is also fundamentally important to consider how the data will be represented within the network. When classifying digits, for example, networks will typically perform extremely poorly if the labels are represented as a single number: 0.0, 1.0, 2.0, etc. To better distinguish the differences between e.g. 0, 1, and 2, it is useful to instead store the values using a one hot encoding: 0 = [ ] 1 = [ ] 2 = [ ] One hot encodings are also typically used for word prediction (by specifying a dictionary of possibilities) and character level predictions (by specifying the admissable character set).
42 Data Preparation In general, it also a good practice to first process/prepare the input values of a dataset before training. For example, if the input values are centered around 100 and all lie within the interval [99.9, 100.1], it is typically better to center and rescale these values beforehand: where µ = 1 D x D x = x µ σ 2 ε x and σ 2 = 1 D (x µ) 2 x D The values of µ and σ 2 can then be saved, and predictions can be made on arbitrary inputs during testing by applying the above normalization before passing the test inputs to the network.
43 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation
44 Gradient Descent Gradient descent provides a simple, iterative algorithm for finding local minima of a real-valued function F numerically. The main idea behind gradient descent is relatively straightforward: compute the gradient of the function that we want to minimize and take a step in the direction of steepest descent ( i.e. F (x) ). The iteration step of the algorithm is defined in terms of a step size parameter α ( or by a decreasing sequence {α i } ) by setting: x i+1 = x i α F (x i ) Note: Convergence is only guaranteed under certain assumptions on the function F (e.g. convexity, Lipschitz continuity, etc.).
45 Loss Functions for Large Datasets In the context of neural networks, gradient descent appears to provide a reasonable approach for tuning network parameters. The initial weights and biases can be interpreted as a single vector θ 0, and the iteration steps from the previous slide could, in theory, be used to identify the optimal parameters θ for the model. The issue with this approach is that the function we are actually trying to minimize is defined in terms of the entire dataset D: F (θ) = 1 D x D f(x θ) where f(x θ) denotes the loss for a single example x when using the model parameters θ. So the standard algorithm would require computing the average loss at each step of the iterative scheme...
46 Stochastic Gradient Descent Since computing the true gradient F (θ) at every step is impractical for large datasets, we can instead try to approximate this gradient using a smaller, more managable mini-batch of data: F (θ) F i (θ) = 1 B i x B i f(x θ) where the batches {B i } partition the dataset into smaller subsets (typically of equal size). The iteration step is then taken to be: θ i+1 = θ i α F i (θ i )
47 Potential Obstacles Fixed learning rates typically lead to suboptimal performance Defining a learning rate schedule manually does not allow the algorithm to adapt to the particular problem in consideration Different parameters often require different learning rates Since the directions/magnitudes of previous updates are not taken into consideration, defining optimization policies on small batches of data may lead to a noisy, inefficient training process
48 Importance of Selecting the Correct Learning Rate High Learning Rate Low Learning Rate
49 Nesterov Momentum Nesterov, Y.E., A method for solving the convex programming problem with convergence rate O(1/kˆ2). In Dokl. Akad. Nauk SSSR (Vol. 269, pp ). One method for learning from previous steps is to incorporate momentum into the update policy. This can be done by setting: v i = γ v i 1 + α F (θ) θ i+1 = θ i v i An accelerated form was introduced by Nesterov in 1983 which leverages the value of looking ahead before making updates: v i = γ v i 1 + α F (θ γ v i 1 )
50 AdaGrad and RMSProp Duchi, J., Hazan, E. and Singer, Y., Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), pp Hinton, G., Srivastava, N. and Swersky, K., Neural Networks for Machine Learning Lecture 6a Overview of mini batch gradient descent. AdaGrad defines parameter specific updates which are normalized by the sum of the squares of previous gradients; this leads to a natural learning rate decay (often too much) RMSProp keeps moving averages of the squared gradients for each parameter which are used to rescale updates AdaDelta provides another method for rescaling updates, and many other variants (e.g. including momentum) exist...
51 Exponential Moving Averages One common method for estimating an average incrementally is to keep an exponential moving average of the values. This method applies an exponential decay to terms in the average which places an emphasis on the most recent values; this allows the average to move, or correct itself, as the distribution of the values changes. To track the gradient g t = F (θ t 1 ) of the loss with respect to the parameters θ, we can define an average recursively by setting: { m 0 = 0 m t = β m t 1 + (1 β) g t t m t = (1 β) β t τ g τ τ=1 where the parameter β is used to specify the exponential decay rate and is typically taken to be close to, but smaller than, 1.
52 The Adam Optimization Algorithm Kingma, D.P. and Ba, J., Adam: A method for stochastic optimization. arxiv preprint arxiv: The Adam optimizer, derived from adaptive moment estimation, proposes keeping exponential moving averages of both the first moment g t and the (uncentered) second moment g 2 t of the gradient. In addition, a bias correction is introduced to address the issue of arbitrarily initializing the exponential moving averages with zero. The first moment average m t and second moment average v t are defined with decay rates β 1 and β 2, respectively, and the bias correction procedure is defined by the rescaling: m t = m t 1 β t 1 v t = v t 1 β t 2
53 The Adam Optimization Algorithm
54 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation
55 Backpropagation Rumelhart, D.E., Hinton, G.E. and Williams, R.J., Learning representations by back-propagating errors. Nature, 323(6088), p.533. While this theoretical framework for neural network optimization may seem complete, one fundamental question still remains: How are the gradients of network parameters actually computed? One approach, referred to as backpropagation, was proposed in 1986 which dealt with sigmoidal activations "σ" and defined the loss "E" in terms of predictions "y j " and true values "d j " via: E = 1 2 (yj d j ) 2
56 Backpropagation: Rumelhart, Hinton, and Williams
57 Backpropagation: Rumelhart, Hinton, and Williams
58 Backpropagation: Rumelhart, Hinton, and Williams
59 Backpropagation: Rumelhart, Hinton, and Williams
60 Backpropagation Now that the error contribution associated with y i is known: i.e. E y i = j E x j w ji contributions from network parameters of the previous layer can be computed using the same methodology that was applied to y j : e.g. E x i = E y i dy i dx i = E d σ(x i ) y i dx i In this way, gradient calculations for all network parameters can be computed by propagating back the error contributions from the parameters in subsequent layers which depend on their values.
61 Symbolic and Numeric Differentiation Two commonly used methods for automating the process of computing derivatives are symbolic differentiation and numeric differentiation; however, both of these methods have severe practical limitations in the context of training neural networks. Symbolic differentiation produces exact derivatives through direct manipulation of the mathematical expressions used to define functions; the resulting expressions can be lengthy and contain unnecessary computations, however, and are inefficient unless additional expression simplification steps are included Numeric differentiation techniques are widely applicable and efficient; however, the resulting inexact gradient estimates can entirely undermine the training process for large networks
62 Automatic Differentiation Automatic differentiation (AD) in reverse mode provides a generalization to backpropagation and gives us a way to carry out the required gradient calculations exactly and efficiently. Computes derivatives using the underlying computational graph Very efficient with respect to evaluation time A trace of all elementary operations is stored on an evaluation tape, or Wengert list ; potentially large storage requirements Baydin, A.G., Pearlmutter, B.A., Radul, A.A. and Siskind, J.M., Automatic differentiation in machine learning: a survey. University of Washington CSE599W: Spring 2018 Slides from Lecture 4: Backpropagation and Automatic Differentiation
Deep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationA summary of Deep Learning without Poor Local Minima
A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationStatistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks
Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered
More informationDeep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation
Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks
More informationMachine Learning
Machine Learning 10-315 Maria Florina Balcan Machine Learning Department Carnegie Mellon University 03/29/2019 Today: Artificial neural networks Backpropagation Reading: Mitchell: Chapter 4 Bishop: Chapter
More informationCS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes
CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationEve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates
Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive
More informationIntroduction to Neural Networks
CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationNeural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann
Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable
More informationDeep Feedforward Networks. Seung-Hoon Na Chonbuk National University
Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections
More informationCh.6 Deep Feedforward Networks (2/3)
Ch.6 Deep Feedforward Networks (2/3) 16. 10. 17. (Mon.) System Software Lab., Dept. of Mechanical & Information Eng. Woonggy Kim 1 Contents 6.3. Hidden Units 6.3.1. Rectified Linear Units and Their Generalizations
More informationDeep Learning & Artificial Intelligence WS 2018/2019
Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target
More informationAdvanced Machine Learning
Advanced Machine Learning Lecture 4: Deep Learning Essentials Pierre Geurts, Gilles Louppe, Louis Wehenkel 1 / 52 Outline Goal: explain and motivate the basic constructs of neural networks. From linear
More informationDeep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04
Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures Lecture 04 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Self-Taught Learning 1. Learn
More information(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann
(Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for
More informationMachine Learning Lecture 14
Machine Learning Lecture 14 Tricks of the Trade 07.12.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability
More informationAdvanced Training Techniques. Prajit Ramachandran
Advanced Training Techniques Prajit Ramachandran Outline Optimization Regularization Initialization Optimization Optimization Outline Gradient Descent Momentum RMSProp Adam Distributed SGD Gradient Noise
More informationNeural Networks Learning the network: Backprop , Fall 2018 Lecture 4
Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:
More informationCSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More informationArtificial Neural Networks. MGS Lecture 2
Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation
More informationCSC 578 Neural Networks and Deep Learning
CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationDeepLearning on FPGAs
DeepLearning on FPGAs Introduction to Deep Learning Sebastian Buschäger Technische Universität Dortmund - Fakultät Informatik - Lehrstuhl 8 October 21, 2017 1 Recap Computer Science Approach Technical
More informationDeep Learning II: Momentum & Adaptive Step Size
Deep Learning II: Momentum & Adaptive Step Size CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following
More informationSummary and discussion of: Dropout Training as Adaptive Regularization
Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial
More informationDeep Learning Lab Course 2017 (Deep Learning Practical)
Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of
More informationComments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms
Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationNeural Networks and the Back-propagation Algorithm
Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely
More informationIntroduction to Convolutional Neural Networks (CNNs)
Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei
More informationA Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation
1 Introduction A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation J Wesley Hines Nuclear Engineering Department The University of Tennessee Knoxville, Tennessee,
More informationLecture 3 Feedforward Networks and Backpropagation
Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression
More informationCourse 395: Machine Learning - Lectures
Course 395: Machine Learning - Lectures Lecture 1-2: Concept Learning (M. Pantic) Lecture 3-4: Decision Trees & CBC Intro (M. Pantic & S. Petridis) Lecture 5-6: Evaluating Hypotheses (S. Petridis) Lecture
More informationECE521 Lecture 7/8. Logistic Regression
ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression
More informationNeural Networks. Nicholas Ruozzi University of Texas at Dallas
Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify
More informationJakub Hajic Artificial Intelligence Seminar I
Jakub Hajic Artificial Intelligence Seminar I. 11. 11. 2014 Outline Key concepts Deep Belief Networks Convolutional Neural Networks A couple of questions Convolution Perceptron Feedforward Neural Network
More informationMachine Learning
Machine Learning 10-601 Maria Florina Balcan Machine Learning Department Carnegie Mellon University 02/10/2016 Today: Artificial neural networks Backpropagation Reading: Mitchell: Chapter 4 Bishop: Chapter
More informationLecture 6 Optimization for Deep Neural Networks
Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things
More informationBased on the original slides of Hung-yi Lee
Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services
More informationLecture 3 Feedforward Networks and Backpropagation
Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationCSC321 Lecture 5: Multilayer Perceptrons
CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 1 / 21 Overview Recall the simple neuron-like unit: y output output bias i'th weight w 1 w2 w3
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationNeural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016
Neural Networks Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Outline Part 1 Introduction Feedforward Neural Networks Stochastic Gradient Descent Computational Graph
More information1 What a Neural Network Computes
Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists
More informationOverview of gradient descent optimization algorithms. HYUNG IL KOO Based on
Overview of gradient descent optimization algorithms HYUNG IL KOO Based on http://sebastianruder.com/optimizing-gradient-descent/ Problem Statement Machine Learning Optimization Problem Training samples:
More informationCSC 411 Lecture 10: Neural Networks
CSC 411 Lecture 10: Neural Networks Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 10-Neural Networks 1 / 35 Inspiration: The Brain Our brain has 10 11
More informationArtificial Neural Networks
Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation Hidden layer representations Example: Face Recognition Advanced topics 1 Connectionist Models Consider humans:
More informationNeural Networks and Deep Learning.
Neural Networks and Deep Learning www.cs.wisc.edu/~dpage/cs760/ 1 Goals for the lecture you should understand the following concepts perceptrons the perceptron training rule linear separability hidden
More informationNeural networks COMS 4771
Neural networks COMS 4771 1. Logistic regression Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a
More information4. Multilayer Perceptrons
4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output
More informationSGD and Deep Learning
SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients
More informationIntroduction to Machine Learning Spring 2018 Note Neural Networks
CS 189 Introduction to Machine Learning Spring 2018 Note 14 1 Neural Networks Neural networks are a class of compositional function approximators. They come in a variety of shapes and sizes. In this class,
More informationIntroduction to Deep Neural Networks
Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic
More informationCSC321 Lecture 9: Generalization
CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions
More informationTopics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)
Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound Lecture 3: Introduction to Deep Learning (continued) Course Logistics - Update on course registrations - 6 seats left now -
More information<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)
Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation
More informationDEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY
DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo
More informationBits of Machine Learning Part 1: Supervised Learning
Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationAdam: A Method for Stochastic Optimization
Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Content Background Supervised ML theory and the importance of optimum finding Gradient descent and its variants Limitations
More informationClassification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box
ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses
More informationTips for Deep Learning
Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results
More informationAdvanced statistical methods for data analysis Lecture 2
Advanced statistical methods for data analysis Lecture 2 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1 Outline
More informationNeural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35
Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How
More informationDeep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści
Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?
More informationCSC321 Lecture 9: Generalization
CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions
More informationMachine Learning Basics III
Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient
More informationNeural Networks (Part 1) Goals for the lecture
Neural Networks (Part ) Mark Craven and David Page Computer Sciences 760 Spring 208 www.biostat.wisc.edu/~craven/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationDeep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang
Deep Feedforward Networks Han Shao, Hou Pong Chan, and Hongyi Zhang Deep Feedforward Networks Goal: approximate some function f e.g., a classifier, maps input to a class y = f (x) x y Defines a mapping
More informationFeed-forward Network Functions
Feed-forward Network Functions Sargur Srihari Topics 1. Extension of linear models 2. Feed-forward Network Functions 3. Weight-space symmetries 2 Recap of Linear Models Linear Models for Regression, Classification
More informationLearning Deep Architectures for AI. Part I - Vijay Chakilam
Learning Deep Architectures for AI - Yoshua Bengio Part I - Vijay Chakilam Chapter 0: Preliminaries Neural Network Models The basic idea behind the neural network approach is to model the response as a
More informationRegularization in Neural Networks
Regularization in Neural Networks Sargur Srihari 1 Topics in Neural Network Regularization What is regularization? Methods 1. Determining optimal number of hidden units 2. Use of regularizer in error function
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationMultilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)
Multilayer Neural Networks (sometimes called Multilayer Perceptrons or MLPs) Linear separability Hyperplane In 2D: w x + w 2 x 2 + w 0 = 0 Feature x 2 = w w 2 x w 0 w 2 Feature 2 A perceptron can separate
More informationEVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)
EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN) TARGETED PIECES OF KNOWLEDGE Linear regression Activation function Multi-Layers Perceptron (MLP) Stochastic Gradient Descent
More informationProbabilistic Graphical Models
10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem
More informationNeural Networks: Optimization & Regularization
Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg
More informationOPTIMIZATION METHODS IN DEEP LEARNING
Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate
More informationPattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore
Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal
More informationDeep Feedforward Networks
Deep Feedforward Networks Yongjin Park 1 Goal of Feedforward Networks Deep Feedforward Networks are also called as Feedforward neural networks or Multilayer Perceptrons Their Goal: approximate some function
More informationECE521 Lectures 9 Fully Connected Neural Networks
ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationData Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,
More informationCSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!
CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!! November 18, 2015 THE EXAM IS CLOSED BOOK. Once the exam has started, SORRY, NO TALKING!!! No, you can t even say see ya
More informationLarge-Scale Feature Learning with Spike-and-Slab Sparse Coding
Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab
More informationThe XOR problem. Machine learning for vision. The XOR problem. The XOR problem. x 1 x 2. x 2. x 1. Fall Roland Memisevic
The XOR problem Fall 2013 x 2 Lecture 9, February 25, 2015 x 1 The XOR problem The XOR problem x 1 x 2 x 2 x 1 (picture adapted from Bishop 2006) It s the features, stupid It s the features, stupid The
More informationCS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS
CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting
More informationSpeaker Representation and Verification Part II. by Vasileios Vasilakakis
Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation
More informationMore Tips for Training Neural Network. Hung-yi Lee
More Tips for Training Neural Network Hung-yi ee Outline Activation Function Cost Function Data Preprocessing Training Generalization Review: Training Neural Network Neural network: f ; θ : input (vector)
More informationNonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.
Nonlinear Models Numerical Methods for Deep Learning Lars Ruthotto Departments of Mathematics and Computer Science, Emory University Intro 1 Course Overview Intro 2 Course Overview Lecture 1: Linear Models
More informationTips for Deep Learning
Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results
More informationIntroduction to Deep Learning CMPT 733. Steven Bergner
Introduction to Deep Learning CMPT 733 Steven Bergner Overview Renaissance of artificial neural networks Representation learning vs feature engineering Background Linear Algebra, Optimization Regularization
More information