Understanding Neural Networks : Part I

Size: px
Start display at page:

Download "Understanding Neural Networks : Part I"

Transcription

1 TensorFlow Workshop 2018 Understanding Neural Networks Part I : Artificial Neurons and Network Optimization Nick Winovich Department of Mathematics Purdue University July 2018

2 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

3 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

4 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

5 Artificial Neural Networks Neural networks are a class of simple, yet effective, computing systems with a diverse range of applications. In these systems, small computational units, or nodes, are arranged to form networks in which connectivity is leveraged to carry out complex calculations. Deep Learning by Goodfellow, Bengio, and Courville: Convolutional Neural Networks for Visual Recognition at Stanford:

6 Artificial Neurons Diagram modified from Stack Exchange post answered by Gonzalo Medina. Weights are first used to scale inputs; the results are summed with a bias term and passed through an activation function.

7 Formula and Vector Representation The diagram from the previous slide can be interpreted as: y = f ( x 1 w 1 + x 2 w 2 + x 3 w 3 + b ) which can be conveniently represented in vector form via: y = f ( w T x + b ) by interpreting the neuron inputs and weights as column vectors.

8 Artificial Neurons: Multiple Outputs

9 Matrix Representation This corresponds to a pair of equations, one for each ouput: y 1 = f ( w T 1 x + b 1 ) y 2 = f ( w T 2 x + b 2 ) which can be represented in matrix form by the system: y = f ( W x + b ) where we assume the activation function has been vectorized.

10 Fully-Connected Neural Layers The resulting layers, referred to as fully-connected or dense, can be visualized as a collection of nodes connected by edges corresponding to weights (bias/activations are typically omitted)

11 Floating Point Operation Count Mult: MN Matrix-Vector Multiplication w w 1N x w M1... w MN x N w 11 x 1... w 1N x N..... w M1 x 1... w MN x N Add: M(N 1) w 11 x w 1N x N... w M1 x w MN x N

12 Floating Point Operation Count So we see that when bias terms are omitted, the FLOPs required for a neural connection between N inputs and M outputs is: 2 MN M = MN multiplies + M(N 1) adds When bias terms are included, an additional M addition operations are required, resulting in a total of 2 MN FLOPs. Note: This omits the computation required for applying the activation function to M values resulting from the linear operations. Depending on the activation function selected, this may or may not have a significant impact on the overall computational complexity.

13 Activation Functions Activation functions are a fundamental component of neural network architectures; these functions are responsible for: Providing all of the network s non-linear modeling capacity Controlling the gradient flows that guide the training process While activation functions play a fundamental role in all neural networks, it is still desirable to limit their computational demands (e.g. avoid defining them in terms of a Krylov subspace method...). In practice, activations such as rectified linear units (ReLUs) with the most trivial function and derivative definitions often suffice.

14 Activation Functions Rectified Linear Unit (ReLU) { x x 0 f(x) = 0 x < 0 SoftPlus Activation f(x) = ln ( 1 + exp( x) )

15 Activation Functions Sigmoidal Unit Hyperbolic Tangent Unit f(x) = exp( x) f(x) = tanh(x)

16 Activation Functions (Parameterized) Exponential Linear Unit (ELU) Leaky Rectified Linear Unit { { f α (x) = x x 0 x x 0 α (e x f α (x) = 1) x < 0 α x x < 0

17 Activation Functions (Learnable Parameters) Parameterized ReLU { β x x 0 f β (x) = x x < 0 f β (x) = Swish Units x 1 + exp( β x)

18 Hidden Layers Intermediate, or hidden, layers can be added between the input and ouput nodes to allow for additional non-linear processing. For example, we can first define a layer such as: h = f 1 ( W 1 x + b 1 ) and construct a subsequent layer to produce the final output: y = f 2 ( W 2 h + b 2 )

19 Hidden Layers

20 Multiple Hidden Layers

21 Multiple Hidden Layers Multiple hidden layers can easily be defined in the same way: h 1 = f 1 ( W 1 x + b 1 ) h 2 = f 2 ( W 2 h 1 + b 2 ) y = f 3 ( W 3 h 2 + b 3 ) One of the challenges of working with additional layers is the need to determine the impact that earlier layers have on the final ouput. This will be necessary for tuning/optimizing network parameters (i.e. weights and biases) to produce accurate predictions.

22 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

23 Universal Approximators: Cybenko (1989) Cybenko, G., Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), pp Basic Idea of Result: Let I n denote the unit hypercube in R n ; the collection of functions which can be expressed in the form: N i=1 α i σ ( w T i x + b i ) x In is dense in the space of continuous functions C(I n ) defined on I n : i.e. f C(I n ), ε > 0 there exist constants N, α i, w i, b i such that f(x) N α i σ(w T i=1 i x + b i ) < ε x I n

24 Universal Approximators: Hornik et al. / Funahashi Hornik, K., Stinchcombe, M. and White, H., Multilayer feedforward networks are universal approximators. Neural networks, 2(5), pp Funahashi, K.I., On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3), pp Summary of Results: For any compact set K R n, multi-layer feedforward neural networks are dense in the space of continuous funtions C(K) on K, with respect to the supremum norm, provided that the activation function used for the network layers is: Continuous and increasing Non-constant and bounded

25 Universal Approximators: Leshno et al. (1992) Leshno, M., Lin, V.Y., Pinkus, A. and Schocken, S., Multilayer feedforward networks with a non-polynomial activation function can approximate any function. A standard multilayer feedforward network with a locally bounded piecewise continuous activation function can approximate any continuous function to any degree of accuracy if and only if the network s activation function is not a polynomial. (Leshno et al.) Here the notion of approximation is also defined in terms of the supremum norm, and the domains are assumed to be compact The result does not hold without thresholds (i.e. bias terms)

26 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

27 Overfitting In some cases the network is capable learning too much from the specific training data used; this phenomenon is referred to as overfitting and occurs when the model performs well on the training dataset, but does not generalize to accurate predictions on data which has not been seen during training. Consider, for example:

28 L1 and L2 Weight Regularization One simple technique to help avoid overfitting is to add a penalty for network parameters with large L1 or L2 norms. This is similar to the underlying idea behind LASSO regression and can be loosely interpreted as a form of applying the principle of Ockham s Razor: i.e. the simplest solution often turns out to be the correct solution. L2 regularization is a fairly general regularization technique which places an emphasis on reducing the largest weights L1 regularization helps to encourage sparsity in the network and improves performance when the problem has a sparse solution

29 Applying Dropout Applying dropout to hidden network layers also helps to avoid overfitting. This technique consists of removing, or dropping, units/nodes randomly at each step of the training process. A fixed drop rate p (0, 1) is specified prior to training, and nodes in the layer are dropped according to a collection of i.i.d. random Bernoulli samples drawn at each training step Since all nodes will be used after training, the outputs of the remaining nodes are rescaled by a factor of 1/(1 p) to ensure that the expected values during training and testing coincide Loosely speaking, this can be thought of as a way to ensure that no individual node plays too large of a role in the final prediction.

30 Example: Dropout with Rate = 0.25 [ Training ]

31 Example: Dropout with Rate = 0.25 [ Training ]

32 Example: Dropout with Rate = 0.25 [ Training ]

33 Example: Dropout with Rate = 0.25 [ Testing ]

34 Motivation for Batch Normalization Ioffe, S. and Szegedy, C., Batch normalization: Accelerating deep network training by reducing internal covariate shift. arxiv preprint arxiv: Internal Covariate Shift: As network parameters change during training, the distributions of the input values to each layer change. Training could be more efficient if the layers were receiving inputs with a fixed distribution throughout the entire process Achieving this using normalization requires a technique which is compatabile with gradient-based optimization

35 Batch Normalization The proposed batch normalization technique corresponds to first performing a normalization with respect to the batch statistics: x = x µ B σ 2 B + ε with µ B = 1 m x, σb 2 = 1 m x B (x µ B ) 2 x B where m is a fixed batch size, and ε > 0 for numerical stability. A linear map with learnable parameters γ and β is then applied: y i = γ x i + β and the normalized values {y i } are passed to the activation function to apply the non-linear transformation for the layer.

36 Batch Normalization after Training After training, we need a way to freeze the model in place for making predictions. This is accomplished by specifying a fixed normalization rule for each layer; rather than use sample statistics from a specific batch, it is natural to incorporate the entire dataset: x = x µ σ 2 + ε where µ is the empirical mean E D [x] and σ 2 is the variance Var D [x] taken with respect to the complete training dataset D. These values can be tracked using moving averages during training to avoid direct computation and provide accurate estimates when parameter changes are small near the end of the training process.

37 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

38 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

39 Evaluating a Network Up until now, we have still not discussed how to quantify the performance of a neural network. The most common strategy for quantifying performance is to define loss functions for the network with low loss corresponding to high performance. In supervised training, where the true labels/solutions are known, the network loss function is typically composed of: A primary loss term corresponding to a measure of how close the network predictions are to the true solutions Auxiliary loss components, such as weight regularization penalties, designed to help guide the training process Once the network performance is quantified, we can specify an optimization algorithm designed to minimize the network loss.

40 Loss Functions Two of the most common applications of neural networks are regression (for predicting continuous properties/values) and classification (for predicting discrete properties or labels). For regression, a standard loss is given by the mean squared error: Loss = 1 I (ŷ i y i ) 2 i I where I are the indices of the output data (e.g. pixels of an image). For classification, softmax cross entropy can be used when labels are mutually exclusive (e.g. classifying a digit as 0, 1, 2 etc.); sigmoid cross entropy can be used when labels are not mutually exclusive (e.g. determining which objects are in an image).

41 One Hot Encoding It is also fundamentally important to consider how the data will be represented within the network. When classifying digits, for example, networks will typically perform extremely poorly if the labels are represented as a single number: 0.0, 1.0, 2.0, etc. To better distinguish the differences between e.g. 0, 1, and 2, it is useful to instead store the values using a one hot encoding: 0 = [ ] 1 = [ ] 2 = [ ] One hot encodings are also typically used for word prediction (by specifying a dictionary of possibilities) and character level predictions (by specifying the admissable character set).

42 Data Preparation In general, it also a good practice to first process/prepare the input values of a dataset before training. For example, if the input values are centered around 100 and all lie within the interval [99.9, 100.1], it is typically better to center and rescale these values beforehand: where µ = 1 D x D x = x µ σ 2 ε x and σ 2 = 1 D (x µ) 2 x D The values of µ and σ 2 can then be saved, and predictions can be made on arbitrary inputs during testing by applying the above normalization before passing the test inputs to the network.

43 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

44 Gradient Descent Gradient descent provides a simple, iterative algorithm for finding local minima of a real-valued function F numerically. The main idea behind gradient descent is relatively straightforward: compute the gradient of the function that we want to minimize and take a step in the direction of steepest descent ( i.e. F (x) ). The iteration step of the algorithm is defined in terms of a step size parameter α ( or by a decreasing sequence {α i } ) by setting: x i+1 = x i α F (x i ) Note: Convergence is only guaranteed under certain assumptions on the function F (e.g. convexity, Lipschitz continuity, etc.).

45 Loss Functions for Large Datasets In the context of neural networks, gradient descent appears to provide a reasonable approach for tuning network parameters. The initial weights and biases can be interpreted as a single vector θ 0, and the iteration steps from the previous slide could, in theory, be used to identify the optimal parameters θ for the model. The issue with this approach is that the function we are actually trying to minimize is defined in terms of the entire dataset D: F (θ) = 1 D x D f(x θ) where f(x θ) denotes the loss for a single example x when using the model parameters θ. So the standard algorithm would require computing the average loss at each step of the iterative scheme...

46 Stochastic Gradient Descent Since computing the true gradient F (θ) at every step is impractical for large datasets, we can instead try to approximate this gradient using a smaller, more managable mini-batch of data: F (θ) F i (θ) = 1 B i x B i f(x θ) where the batches {B i } partition the dataset into smaller subsets (typically of equal size). The iteration step is then taken to be: θ i+1 = θ i α F i (θ i )

47 Potential Obstacles Fixed learning rates typically lead to suboptimal performance Defining a learning rate schedule manually does not allow the algorithm to adapt to the particular problem in consideration Different parameters often require different learning rates Since the directions/magnitudes of previous updates are not taken into consideration, defining optimization policies on small batches of data may lead to a noisy, inefficient training process

48 Importance of Selecting the Correct Learning Rate High Learning Rate Low Learning Rate

49 Nesterov Momentum Nesterov, Y.E., A method for solving the convex programming problem with convergence rate O(1/kˆ2). In Dokl. Akad. Nauk SSSR (Vol. 269, pp ). One method for learning from previous steps is to incorporate momentum into the update policy. This can be done by setting: v i = γ v i 1 + α F (θ) θ i+1 = θ i v i An accelerated form was introduced by Nesterov in 1983 which leverages the value of looking ahead before making updates: v i = γ v i 1 + α F (θ γ v i 1 )

50 AdaGrad and RMSProp Duchi, J., Hazan, E. and Singer, Y., Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), pp Hinton, G., Srivastava, N. and Swersky, K., Neural Networks for Machine Learning Lecture 6a Overview of mini batch gradient descent. AdaGrad defines parameter specific updates which are normalized by the sum of the squares of previous gradients; this leads to a natural learning rate decay (often too much) RMSProp keeps moving averages of the squared gradients for each parameter which are used to rescale updates AdaDelta provides another method for rescaling updates, and many other variants (e.g. including momentum) exist...

51 Exponential Moving Averages One common method for estimating an average incrementally is to keep an exponential moving average of the values. This method applies an exponential decay to terms in the average which places an emphasis on the most recent values; this allows the average to move, or correct itself, as the distribution of the values changes. To track the gradient g t = F (θ t 1 ) of the loss with respect to the parameters θ, we can define an average recursively by setting: { m 0 = 0 m t = β m t 1 + (1 β) g t t m t = (1 β) β t τ g τ τ=1 where the parameter β is used to specify the exponential decay rate and is typically taken to be close to, but smaller than, 1.

52 The Adam Optimization Algorithm Kingma, D.P. and Ba, J., Adam: A method for stochastic optimization. arxiv preprint arxiv: The Adam optimizer, derived from adaptive moment estimation, proposes keeping exponential moving averages of both the first moment g t and the (uncentered) second moment g 2 t of the gradient. In addition, a bias correction is introduced to address the issue of arbitrarily initializing the exponential moving averages with zero. The first moment average m t and second moment average v t are defined with decay rates β 1 and β 2, respectively, and the bias correction procedure is defined by the rescaling: m t = m t 1 β t 1 v t = v t 1 β t 2

53 The Adam Optimization Algorithm

54 Outline 1 Neural Networks Artificial Neurons and Hidden Layers Universal Approximation Theorem Regularization and Batch Norm 2 Network Optimization Evaluating Network Performance Stochastic Gradient Descent Algorithms Backprop and Automatic Differentiation

55 Backpropagation Rumelhart, D.E., Hinton, G.E. and Williams, R.J., Learning representations by back-propagating errors. Nature, 323(6088), p.533. While this theoretical framework for neural network optimization may seem complete, one fundamental question still remains: How are the gradients of network parameters actually computed? One approach, referred to as backpropagation, was proposed in 1986 which dealt with sigmoidal activations "σ" and defined the loss "E" in terms of predictions "y j " and true values "d j " via: E = 1 2 (yj d j ) 2

56 Backpropagation: Rumelhart, Hinton, and Williams

57 Backpropagation: Rumelhart, Hinton, and Williams

58 Backpropagation: Rumelhart, Hinton, and Williams

59 Backpropagation: Rumelhart, Hinton, and Williams

60 Backpropagation Now that the error contribution associated with y i is known: i.e. E y i = j E x j w ji contributions from network parameters of the previous layer can be computed using the same methodology that was applied to y j : e.g. E x i = E y i dy i dx i = E d σ(x i ) y i dx i In this way, gradient calculations for all network parameters can be computed by propagating back the error contributions from the parameters in subsequent layers which depend on their values.

61 Symbolic and Numeric Differentiation Two commonly used methods for automating the process of computing derivatives are symbolic differentiation and numeric differentiation; however, both of these methods have severe practical limitations in the context of training neural networks. Symbolic differentiation produces exact derivatives through direct manipulation of the mathematical expressions used to define functions; the resulting expressions can be lengthy and contain unnecessary computations, however, and are inefficient unless additional expression simplification steps are included Numeric differentiation techniques are widely applicable and efficient; however, the resulting inexact gradient estimates can entirely undermine the training process for large networks

62 Automatic Differentiation Automatic differentiation (AD) in reverse mode provides a generalization to backpropagation and gives us a way to carry out the required gradient calculations exactly and efficiently. Computes derivatives using the underlying computational graph Very efficient with respect to evaluation time A trace of all elementary operations is stored on an evaluation tape, or Wengert list ; potentially large storage requirements Baydin, A.G., Pearlmutter, B.A., Radul, A.A. and Siskind, J.M., Automatic differentiation in machine learning: a survey. University of Washington CSE599W: Spring 2018 Slides from Lecture 4: Backpropagation and Automatic Differentiation

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks

More information

Machine Learning

Machine Learning Machine Learning 10-315 Maria Florina Balcan Machine Learning Department Carnegie Mellon University 03/29/2019 Today: Artificial neural networks Backpropagation Reading: Mitchell: Chapter 4 Bishop: Chapter

More information

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive

More information

Introduction to Neural Networks

Introduction to Neural Networks CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections

More information

Ch.6 Deep Feedforward Networks (2/3)

Ch.6 Deep Feedforward Networks (2/3) Ch.6 Deep Feedforward Networks (2/3) 16. 10. 17. (Mon.) System Software Lab., Dept. of Mechanical & Information Eng. Woonggy Kim 1 Contents 6.3. Hidden Units 6.3.1. Rectified Linear Units and Their Generalizations

More information

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning & Artificial Intelligence WS 2018/2019 Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Lecture 4: Deep Learning Essentials Pierre Geurts, Gilles Louppe, Louis Wehenkel 1 / 52 Outline Goal: explain and motivate the basic constructs of neural networks. From linear

More information

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04 Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures Lecture 04 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Self-Taught Learning 1. Learn

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

Machine Learning Lecture 14

Machine Learning Lecture 14 Machine Learning Lecture 14 Tricks of the Trade 07.12.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability

More information

Advanced Training Techniques. Prajit Ramachandran

Advanced Training Techniques. Prajit Ramachandran Advanced Training Techniques Prajit Ramachandran Outline Optimization Regularization Initialization Optimization Optimization Outline Gradient Descent Momentum RMSProp Adam Distributed SGD Gradient Noise

More information

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4 Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Artificial Neural Networks. MGS Lecture 2

Artificial Neural Networks. MGS Lecture 2 Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation

More information

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

DeepLearning on FPGAs

DeepLearning on FPGAs DeepLearning on FPGAs Introduction to Deep Learning Sebastian Buschäger Technische Universität Dortmund - Fakultät Informatik - Lehrstuhl 8 October 21, 2017 1 Recap Computer Science Approach Technical

More information

Deep Learning II: Momentum & Adaptive Step Size

Deep Learning II: Momentum & Adaptive Step Size Deep Learning II: Momentum & Adaptive Step Size CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Deep Learning Lab Course 2017 (Deep Learning Practical)

Deep Learning Lab Course 2017 (Deep Learning Practical) Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Neural Networks and the Back-propagation Algorithm

Neural Networks and the Back-propagation Algorithm Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation

A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation 1 Introduction A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation J Wesley Hines Nuclear Engineering Department The University of Tennessee Knoxville, Tennessee,

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

Course 395: Machine Learning - Lectures

Course 395: Machine Learning - Lectures Course 395: Machine Learning - Lectures Lecture 1-2: Concept Learning (M. Pantic) Lecture 3-4: Decision Trees & CBC Intro (M. Pantic & S. Petridis) Lecture 5-6: Evaluating Hypotheses (S. Petridis) Lecture

More information

ECE521 Lecture 7/8. Logistic Regression

ECE521 Lecture 7/8. Logistic Regression ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression

More information

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks. Nicholas Ruozzi University of Texas at Dallas Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify

More information

Jakub Hajic Artificial Intelligence Seminar I

Jakub Hajic Artificial Intelligence Seminar I Jakub Hajic Artificial Intelligence Seminar I. 11. 11. 2014 Outline Key concepts Deep Belief Networks Convolutional Neural Networks A couple of questions Convolution Perceptron Feedforward Neural Network

More information

Machine Learning

Machine Learning Machine Learning 10-601 Maria Florina Balcan Machine Learning Department Carnegie Mellon University 02/10/2016 Today: Artificial neural networks Backpropagation Reading: Mitchell: Chapter 4 Bishop: Chapter

More information

Lecture 6 Optimization for Deep Neural Networks

Lecture 6 Optimization for Deep Neural Networks Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

CSC321 Lecture 5: Multilayer Perceptrons

CSC321 Lecture 5: Multilayer Perceptrons CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 1 / 21 Overview Recall the simple neuron-like unit: y output output bias i'th weight w 1 w2 w3

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Neural Networks Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Outline Part 1 Introduction Feedforward Neural Networks Stochastic Gradient Descent Computational Graph

More information

1 What a Neural Network Computes

1 What a Neural Network Computes Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists

More information

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on Overview of gradient descent optimization algorithms HYUNG IL KOO Based on http://sebastianruder.com/optimizing-gradient-descent/ Problem Statement Machine Learning Optimization Problem Training samples:

More information

CSC 411 Lecture 10: Neural Networks

CSC 411 Lecture 10: Neural Networks CSC 411 Lecture 10: Neural Networks Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 10-Neural Networks 1 / 35 Inspiration: The Brain Our brain has 10 11

More information

Artificial Neural Networks

Artificial Neural Networks Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation Hidden layer representations Example: Face Recognition Advanced topics 1 Connectionist Models Consider humans:

More information

Neural Networks and Deep Learning.

Neural Networks and Deep Learning. Neural Networks and Deep Learning www.cs.wisc.edu/~dpage/cs760/ 1 Goals for the lecture you should understand the following concepts perceptrons the perceptron training rule linear separability hidden

More information

Neural networks COMS 4771

Neural networks COMS 4771 Neural networks COMS 4771 1. Logistic regression Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a

More information

4. Multilayer Perceptrons

4. Multilayer Perceptrons 4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

Introduction to Machine Learning Spring 2018 Note Neural Networks

Introduction to Machine Learning Spring 2018 Note Neural Networks CS 189 Introduction to Machine Learning Spring 2018 Note 14 1 Neural Networks Neural networks are a class of compositional function approximators. They come in a variety of shapes and sizes. In this class,

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued) Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound Lecture 3: Introduction to Deep Learning (continued) Course Logistics - Update on course registrations - 6 seats left now -

More information

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation) Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation

More information

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

Adam: A Method for Stochastic Optimization

Adam: A Method for Stochastic Optimization Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Content Background Supervised ML theory and the importance of optimum finding Gradient descent and its variants Limitations

More information

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses

More information

Tips for Deep Learning

Tips for Deep Learning Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results

More information

Advanced statistical methods for data analysis Lecture 2

Advanced statistical methods for data analysis Lecture 2 Advanced statistical methods for data analysis Lecture 2 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1 Outline

More information

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35 Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

Machine Learning Basics III

Machine Learning Basics III Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient

More information

Neural Networks (Part 1) Goals for the lecture

Neural Networks (Part 1) Goals for the lecture Neural Networks (Part ) Mark Craven and David Page Computer Sciences 760 Spring 208 www.biostat.wisc.edu/~craven/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang Deep Feedforward Networks Han Shao, Hou Pong Chan, and Hongyi Zhang Deep Feedforward Networks Goal: approximate some function f e.g., a classifier, maps input to a class y = f (x) x y Defines a mapping

More information

Feed-forward Network Functions

Feed-forward Network Functions Feed-forward Network Functions Sargur Srihari Topics 1. Extension of linear models 2. Feed-forward Network Functions 3. Weight-space symmetries 2 Recap of Linear Models Linear Models for Regression, Classification

More information

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Learning Deep Architectures for AI. Part I - Vijay Chakilam Learning Deep Architectures for AI - Yoshua Bengio Part I - Vijay Chakilam Chapter 0: Preliminaries Neural Network Models The basic idea behind the neural network approach is to model the response as a

More information

Regularization in Neural Networks

Regularization in Neural Networks Regularization in Neural Networks Sargur Srihari 1 Topics in Neural Network Regularization What is regularization? Methods 1. Determining optimal number of hidden units 2. Use of regularizer in error function

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs) Multilayer Neural Networks (sometimes called Multilayer Perceptrons or MLPs) Linear separability Hyperplane In 2D: w x + w 2 x 2 + w 0 = 0 Feature x 2 = w w 2 x w 0 w 2 Feature 2 A perceptron can separate

More information

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN) EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN) TARGETED PIECES OF KNOWLEDGE Linear regression Activation function Multi-Layers Perceptron (MLP) Stochastic Gradient Descent

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem

More information

Neural Networks: Optimization & Regularization

Neural Networks: Optimization & Regularization Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg

More information

OPTIMIZATION METHODS IN DEEP LEARNING

OPTIMIZATION METHODS IN DEEP LEARNING Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate

More information

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Yongjin Park 1 Goal of Feedforward Networks Deep Feedforward Networks are also called as Feedforward neural networks or Multilayer Perceptrons Their Goal: approximate some function

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,

More information

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!! CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!! November 18, 2015 THE EXAM IS CLOSED BOOK. Once the exam has started, SORRY, NO TALKING!!! No, you can t even say see ya

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

The XOR problem. Machine learning for vision. The XOR problem. The XOR problem. x 1 x 2. x 2. x 1. Fall Roland Memisevic

The XOR problem. Machine learning for vision. The XOR problem. The XOR problem. x 1 x 2. x 2. x 1. Fall Roland Memisevic The XOR problem Fall 2013 x 2 Lecture 9, February 25, 2015 x 1 The XOR problem The XOR problem x 1 x 2 x 2 x 1 (picture adapted from Bishop 2006) It s the features, stupid It s the features, stupid The

More information

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting

More information

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Speaker Representation and Verification Part II. by Vasileios Vasilakakis Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation

More information

More Tips for Training Neural Network. Hung-yi Lee

More Tips for Training Neural Network. Hung-yi Lee More Tips for Training Neural Network Hung-yi ee Outline Activation Function Cost Function Data Preprocessing Training Generalization Review: Training Neural Network Neural network: f ; θ : input (vector)

More information

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University. Nonlinear Models Numerical Methods for Deep Learning Lars Ruthotto Departments of Mathematics and Computer Science, Emory University Intro 1 Course Overview Intro 2 Course Overview Lecture 1: Linear Models

More information

Tips for Deep Learning

Tips for Deep Learning Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results

More information

Introduction to Deep Learning CMPT 733. Steven Bergner

Introduction to Deep Learning CMPT 733. Steven Bergner Introduction to Deep Learning CMPT 733 Steven Bergner Overview Renaissance of artificial neural networks Representation learning vs feature engineering Background Linear Algebra, Optimization Regularization

More information