arxiv: v1 [cs.lg] 11 May 2015

Size: px
Start display at page:

Download "arxiv: v1 [cs.lg] 11 May 2015"

Transcription

1 Improving neural networks with bunches of neurons modeled by Kumaraswamy units: Preliminary study Jakub M. Tomczak Wrocław University of Technology, wybrzeże Wyspiańskiego 7, 5-37, Wrocław, Poland arxiv: v1 [cs.lg] 11 May 15 Abstract Deep neural networks have recently achieved state-of-the-art results in many machine learning problems, e.g., speech recognition or object recognition. Hitherto, work on rectified linear units (ReLU) provides empirical and theoretical evidence on performance increase of neural networks comparing to typically used sigmoid activation function. In this paper, we investigate a new manner of improving neural networks by introducing a bunch of copies of the same neuron modeled by the generalized Kumaraswamy distribution. As a result, we propose novel nonlinear activation function which we refer to as Kumaraswamy unit which is closely related to ReLU. In the experimental study with MNIST image corpora we evaluate the Kumaraswamy unit applied to single-layer (shallow) neural network and report a significant drop in test classification error and test cross-entropy in comparison to sigmoid unit, ReLU and Noisy ReLU. 1. Introduction Deep neural networks are quickly becoming a crucial element of high performance systems in many domains (Bengio et al., 13), e.g., speech recognition, object recognition, natural language processing, multi-task and domain adaptation. Typical neural networks are based on sigmoid hidden units (Bengio, 9), however, they can suffer from the vanishing gradient problem (Bengio et al., 199). The issue may arise when lower layers of a neural network have gradients nearly because higher layers are mostly saturated at or 1. The vanishing gradients may drastically slow down the optimization procedure and eventually may lead to a poor local minimum. In order to overcome issues associated with the sigmoid Copyright 15 by the author. non-linearity it is advocated to utilize other types of hidden units. Recently, deep neural networks with rectified linear units (ReLU) have seen success in different applications, e.g., signal processing (Zeiler et al., 13), sentiment analysis (Glorot et al., 11), object recognition (Jarrett et al., 9), image analysis (Nair & Hinton, 1). It has been shown that piece-wise linear units, such as ReLU, can compute highly complex and structured functions (Montufar et al., 1). The practical success and theoretical results on ReLU have indicated a new direction for research. In (Maas et al., 13) a leaky version of ReLU was proposed. The empirical evaluation on speech recognition task has shown slight improvement in comparison to sigmoid unit and ReLU. Further investigations with parametrized leaky ReLU (called Parametric ReLU) in (He et al., 15) confirmed the presumption that simple ReLU may be to restrictive to learn fully successful representation. Recently, Agostinelli et al. (15) went even further and proposed Adaptive Piecewise Linear Units (APLU), ReLU with a piecewise linear part for negative values. In the experiments it was shown that APLU can lead to significant performance increase. In this work, we propose to improve neural networks by modeling neurons in a new manner. Our idea is to replicate a neuron with the same weights and biases in order to increase the robustness of learning a pattern. Moreover, we take into account the complex structure of single neuron which is represented by an additional parameter. A suitable fashion of modeling such bunch of neurons is application of a generalized Kumaraswamy distribution (KUM-G) (Cordeiro & de Castro, 11; Nadarajah et al., 1). The Kumaraswamy distribution (KUM) can be seen as an alternative distribution to the Beta distribution (Jones, 9) and KUM-G determines a new class of distribution for given base probability measure. In our case, assuming single neuron in the bunch of neurons is modeled by a sigmoid function, we obtain novel non-linear activation function which we refer to as Kumaraswamy unit. For properly chosen parameters of KUM-G, the Kumaraswamy unit behaves similarly to ReLU. The contribution of the paper is the following: i) we

2 introduce an original non-linear activation function (Kumaraswamy unit) which follows from modeling a bunch of copies of the same neuron using the generalized Kumaraswamy distribution, ii) we provide close relationship between the Kumaraswamy unit and ReLU, iii) we provide an empirical evaluation of a single-layer neural network with the proposed hidden unit applied to MNIST dataset.. Modeling bunch of neurons: Kumaraswamy unit Sigmoid ReLU Noisy ReLU Kumaraswamy(5,6) Kumaraswamy(8,3) Preliminaries Let us focus on conventional feed-forward neural network with an input (visibles) v [, 1] D 1 and an output y {, 1} K 1 such that k y k = 1. In general, the network consists of L hidden layers, however, we restrict our considerations to one hidden layer for clarity. Therefore, the parameters of the network are θ = {c, d, W, U}, where c R M 1 and d R K 1 denote hidden and output biases, respectively, and W R M D are input-to-hidden weights, and U R K M are hiddento-output weights. The output of the network is modeled by the softmax unit: p(y k = 1 v, θ) = exp(u k f(v; c, W) + d k ) l exp(u l f(v; c, W) + d l ) where U i denotes i-th row of the matrix U, and f(v; c, W) is the M-dimensional output of the hidden layer. The activity of hidden units is modeled by some elementwise non-linear function f( ). Typical activation function is the sigmoid function: σ(x) = (1) exp( x). () Recently, several alternatives to the sigmoid function have been used in numerous applications, such as, rectified linear unit (ReLU) (Jarrett et al., 9): r(x) = max{, x}, (3) or noisy rectified linear unit (Noisy ReLU) (Nair & Hinton, 1): n(x) = max{, x + N (, v)}, () where N (, ) is a normal probability density function with zero mean and variance v. Bunch of neurons Let us assume that activation of a neuron is modeled by a sigmoid unit. Moreover, let us presume that each neuron consists of a independent elements and there are b independent copies of the same neuron. Therefore, instead of single neuron we consider a bunch of neurons that try to reflect one pattern. A similar idea with replicas of a neuron was introduced in (Teh & Hinton, 1) Figure 1. A comparison of chosen non-linear activation functions used in a hidden layer of a neural network. The green curve corresponds to typically used sigmoid function. The red curve represents ReLU non-linearity. The black curve shows the expected value of Noisy ReLU with v = 1. The magenta and blue curves depict the Kumaraswamy units with different values of scale parameters, (5, 6) and (8, 3), respectively. where the replication of sigmoid hidden units with the same weights and biases led to binomial units. Further, it turned out that binomial units with fixed offset to biases resulted in softplus units and its fast approximation, i.e., Noisy ReLU (Nair & Hinton, 1). However, in our approach we copy the sigmoid hidden unit b times and additionally we introduce a second parameter, a, which corresponds to modeling complexity of the neuron itself. Increasing the value of b results in higher robustness of the bunch of neurons because it is less probable that the input signal will not activate any hidden unit in the bunch. On the other hand, increasing the value of a leads to higher failure probability of the neuron activation because it suffices that at least one element fails to deactivate the whole neuron. It turns out that a suitable manner of modeling a bunch of neurons as described above is a generalized Kumaraswamy distribution (Cordeiro & de Castro, 11; Nadarajah et al., 1). The generalized Kumaraswamy distribution (KUM- G) for given base distribution G(x) with a probability density function g(x) is defined as follows: K G (x a, b) = 1 (1 G(x) a ) b, (5) where a > and b > are shape parameters. The probability density function of K G (x a, b) has a simple form: k G (x a, b) = a b g(x)g(x) a 1 (1 G(x) a ) b 1. (6)

3 For integer-valued shape parameters a and b KUM-G has a nice interpretation of a system which consists of b independent components and each component is made up of a independent subcomponents. KUM-G perfectly fits to modeling chosen property of a complex system, such as, lifetime of an entire system (Nadarajah et al., 1). We can clearly apply KUM-G to represent the bunch of neurons. Kumaraswamy unit Assuming that single neuron activates according to the sigmoid function, we can take advantage of KUM-G to model the bunch of neurons which yields a new kind of non-linear activation function: K σ (x a, b) = 1 (1 σ(x) a ) b. (7) We refer this resulting unit to as Kumaraswamy unit (Kumaraswamy(a, b)). Obviously, for a = b = 1 one recovers the sigmoid activation function. In the context of gradient-based learning algorithm it is important to compute a derivative of a hidden unit. Since the derivative of the sigmoid function can be easily calculated, the derivative of the Kumaraswamy unit can be obtained immediately (see Equation 6). An intriguing property of the Kumaraswamy unit is that for properly chosen values of a and b it can behave like ReLU (see Figure 1 for comparison of sigmoid unit, ReLU, Noisy ReLu and the Kumaraswamy unit). We consider only two pairs of values of shape parameters, namely, (a, b) {(5, 6), (8, 3)}. The Kumaraswamy unit with a = 5 and b = 6 is the closest 1 approximation of ReLU for value.5, while the second pair of values gives the closest approximation of ReLU in points.5 and.75. As we can evidently notice in Figure 1, the Kumaraswamy unit can behave similarly to ReLU but it returns values between and 1 like the sigmoid function. We argue that such behavior may be crucial in training a neural network and is more biologically plausible. Training Learning the parameters of the network θ for given data D = {(v n, t n )} N n=1 is performed by minimizing the cross-entropy loss. In the case of the neural network with output given by the softmax unit, the cross-entropy loss is equivalent to the negative conditional log-likelihood function: l(θ) = t kn log p(y kn v n, θ). (8) n 3. Experiments k Goal In the experiment we aim at answering the following question: 1 In the Euclidean sense. Is the Kumaraswamy unit preferable to sigmoid unit, ReLU or Noisy ReLU in training a single-layer neural network using stochastic gradient descent? We want to point out that we try to verify only the impact of the proposed non-linearity on learning the neural network. We believe that positive answer to the stated question will give us a good starting point for further experiments with multi-layered neural networks, unsupervised pre-training and more sophisticated training techniques. Data In the experiment we deal with the well-known MNIST dataset for hand-written digit classification. We split the dataset to 5, training images, 1, objects for validation, and 1, test examples. Each image consists of 8 8 pixels (D = 78), and is labeled as one of ten possible digits (K = 1). Learning methodology In the experiment we focus on a single-layer neural network with 5 hidden units (M = 5). We train the model using stochastic gradient descent with momentum and a mini-batch size of 1 examples. The initial value of the momentum term is set according to the model selection with possible values in {,.5}, and after 5 epochs it is set to.9. The learning rate is determined in the model selection procedure with possible values in {.1,.1,.1}. During learning process, we apply the learning step policy in which the learning step is divided by after each 1 epochs. Additionally, we add a weight decay to the learning objective and we explore the following values of the regularization coefficient: {, 1 5, 1 }. The maximum number of epochs is set to 1. The number of iterations over the training set is determined using early stopping according to the validation classification error, with a look ahead of 1 epochs. For all activation units the same initialization of the parameters is applied. Evaluation methodology In order to verify whether the Kumaraswamy unit is preferable to the sigmoid unit, ReLU or Noisy ReLU, we use two evaluation metrics, namely, the test classification error (Error) and the test crossentropy loss (Cross-Entropy). 3 Additionally, we measure the mean activity of hidden units. Results The results for the considered activation nonlinear activation functions are gathered in Table 1. A single run of a training procedure for the considered units is presented in Figure. Moreover, in order to get better insight into the trained representation by the considered nonlinearities input-to-hidden weights are depicted in Figure In the experiment we report the test cross-entropy loss divided by the number of test examples.

4 and the mean activities of hidden units are demonstrated in Figure. Table 1. Test classification error and test cross-entropy loss for single-layer neural network with different hidden units on MNIST averaged over 5 experiment runs. The best results are in bold. Hidden unit Error [%] Cross-Entropy sigmoid ReLU Noisy ReLU Kumaraswamy(5,6) Kumaraswamy(8,3) Discussion According to the results (see Table 1) we can conclude that the Kumaraswamy unit indeed tends to be preferable in comparison to sigmoid unit, ReLU and Noisy ReLU. The Kumaraswamy unit obtained the best results in terms of the test classification error and the test cross-entropy loss. The comparison of the two considered possible values of the scale parameters reveals that the Kumaraswamy(8, 3) performs better than the Kumaraswamy(5, 6), i.e., the bunches of 3 neurons with 8 elements seem to be more suitable in training a neural network. It is a well-known fact that application of ReLU can cause saturation of some hidden units. However, we notice that the saturation of several hidden units resulted in faster convergence of ReLU and Noisy ReLU in comparison to other activation non-linearities (see Figure ). For instance, ReLU obtained the minimum after 18 epochs while the sigmoid neural network converged after 65 iterations. It is worth noticing that the Kumaraswamy unit obtained the the best result (better by about 3% in comparison to others) just after 1 epochs. In Figure 3 the learned features (input-to-hidden weights) for the considered types of hidden units are depicted. The features learned by the sigmoid neural network represent different patterns but some of them seem to be redundant, e.g., there are many patterns which look like diagonal stroke. On the contrary, ReLU and Noisy ReLU allow to obtain more diverse features. However, there are many neurons which are saturated, but in the case of Noisy ReLU this effect is less evident. Slightly different features in shape are obtained by Kumaraswamy units. These are more similar to the ones learned by ReLU, nevertheless, they are not saturated or degenerated as in the case of sigmoid units. Next, we investigate the impact of different non-linearities on activity of hidden units. The histograms of mean activity of neurons measured on the test set given in Figure indicate that on average the ReLU, Noisy ReLU and Kumaraswamy units lead to larger number of neurons with activation smaller than.1 in comparison to the sigmoid units, namely, ReLU: 15, Noisy ReLU: 19, Kumaraswamy(5, 6): 7, Kumaraswamy(8, 3): 8, sigmoid:. However, Kumaraswamy unit has two advantages over ReLU and Noisy ReLU. First, there are less saturated neurons. Second, all Kumaraswamy units have mean activity less than.5 while in the case of ReLU and Noisy ReLU there are some units with very strong activation, i.e., above.5 or even larger than 1. Classification error Sigmoid ReLU Noisy ReLU Kumaraswamy(5,6) Kumaraswamy(8,3) Number of epochs Figure. Validation classification error against the number of epochs during learning process for the considered types of hidden units.. Conclusion In this work we introduced a new idea of improving neural networks with bunch of neurons, i.e., replicas of the same neurons which consist of independent elements. The bunch of neurons can be easily modeled by the generalized Kumaraswamy distribution which resulted in a formulation of new non-linear activation function we refer to as Kumaraswamy unit. A nice property of the Kumaraswamy unit is that for properly chosen parameters it can be approximately shaped as ReLU and it returns values between and 1. In the experiment the performance of the neural network with the Kumaraswamy unit was compared with other activation functions, namely, sigmoid unit, ReLU and Noisy ReLU. The obtained results on MNIST seem to confirm the supremacy of the Kumaraswamy unit, nonetheless, this statement needs to be confirmed with more thorough studies. We believe that the performed experiment gives a good starting point for further research on the Kumaraswamy units applied to deep neural networks.

5 (a) Sigmoid (b) ReLU (c) Noisy ReLU (d) Kumaraswamy(5,6) (e) Kumaraswamy(8,3) Figure 3. A visualization of first 1 of the input-to-hidden weights (features). It is apparent that the patterns learned by the neural network with the considered in the paper units differ in shapes. As expected, application of ReLU results in saturation of several hidden units. The same outcome, but less evident, can be observed in the case of the Noisy ReLU. The Kumaraswamy unit do not lead to saturated or degenerated hidden units.

6 (a) Sigmoid (b) ReLU (c) Noisy ReLU (d) Kumaraswamy(5,6) (e) Kumaraswamy(8,3) Figure. of hidden units calculated on the test set. The sigmoid neural network has all hidden units with non-zero activity and some neurons are very active (above.5). Application of ReLU results in many saturated hidden units while the Noisy ReLU alleviates this effect. Nonetheless, for ReLU and Noisy ReLU there are some hidden units with very strong response (i.e. above 1). The utilization of Kumaraswamy units leads to a representation in which all neurons have activity less than.5 on average.

7 Acknowledgments The research conducted by the author has been partially cofinanced by the Ministry of Science and Higher Education, Republic of Poland, grant No. B/I3. References Agostinelli, Forest, Hoffman, Matthew, Sadowski, Peter, and Baldi, Pierre. Learning activation functions to improve deep neural networks. arxiv preprint arxiv:11.683, 15. Bengio, Yoshua. Learning Deep Architectures for AI. Foundations and Trends in Machine Learning, (1):1 17, January 9. ISSN Bengio, Yoshua, Simard, Patrice, and Frasconi, Paolo. Learning long-term dependencies with gradient descent is difficult. Neural Networks, IEEE Transactions on, 5 (): , 199. Bengio, Yoshua, Courville, Aaron, and Vincent, Pascal. Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8): , 13. Montufar, Guido F, Pascanu, Razvan, Cho, Kyunghyun, and Bengio, Yoshua. On the number of linear regions of deep neural networks. In Advances in Neural Information Processing Systems, pp. 9 93, 1. Nadarajah, Saralees, Cordeiro, Gauss M, and Ortega, Edwin MM. General results for the kumaraswamy-g distribution. Journal of Statistical Computation and Simulation, 8(7): , 1. Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning, pp , 1. Teh, Yee Whye and Hinton, Geoffrey E. Rate-coded restricted boltzmann machines for face recognition. Advances in neural information processing systems, pp , 1. Zeiler, Matthew D, Ranzato, M, Monga, Rajat, Mao, M, Yang, K, Le, Quoc Viet, Nguyen, Patrick, Senior, A, Vanhoucke, Vincent, Dean, Jeffrey, and Hinton, Geoffrey E. On rectified linear units for speech processing. In ICASSP, pp , 13. Cordeiro, Gauss M and de Castro, Mário. A new family of generalized distributions. Journal of Statistical Computation and Simulation, 81(7): , 11. Glorot, Xavier, Bordes, Antoine, and Bengio, Yoshua. Deep sparse rectifier networks. In Proceedings of the 1th International Conference on Artificial Intelligence and Statistics. JMLR W&CP Volume, volume 15, pp , 11. He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. arxiv preprint arxiv:15.185, 15. Jarrett, Kevin, Kavukcuoglu, Koray, Ranzato, M, and Le- Cun, Yann. What is the best multi-stage architecture for object recognition? In Computer Vision, 9 IEEE 1th International Conference on, pp IEEE, 9. Jones, MC. Kumaraswamys distribution: A beta-type distribution with some tractability advantages. Statistical Methodology, 6(1):7 81, 9. Maas, Andrew L, Hannun, Awni Y, and Ng, Andrew Y. Rectifier nonlinearities improve neural network acoustic models. In International Conference on Machine Learning, volume 3, 13.

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee New Activation Function Rectified Linear Unit (ReLU) σ z a a = z Reason: 1. Fast to compute 2. Biological reason a = 0 [Xavier Glorot, AISTATS 11] [Andrew L.

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16 Slides adapted from Jordan Boyd-Graber, Justin Johnson, Andrej Karpathy, Chris Ketelsen, Fei-Fei Li, Mike Mozer, Michael Nielson

More information

UNDERSTANDING LOCAL MINIMA IN NEURAL NET-

UNDERSTANDING LOCAL MINIMA IN NEURAL NET- UNDERSTANDING LOCAL MINIMA IN NEURAL NET- WORKS BY LOSS SURFACE DECOMPOSITION Anonymous authors Paper under double-blind review ABSTRACT To provide principled ways of designing proper Deep Neural Network

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

Expressiveness of Rectifier Networks

Expressiveness of Rectifier Networks Xingyuan Pan Vivek Srikumar The University of Utah, Salt Lake City, UT 84112, USA XPAN@CS.UTAH.EDU SVIVEK@CS.UTAH.EDU From the learning point of view, the choice of an activation function is driven by

More information

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Learning Deep Architectures for AI. Part II - Vijay Chakilam Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model

More information

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence ESANN 0 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 7-9 April 0, idoc.com publ., ISBN 97-7707-. Stochastic Gradient

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Expressiveness of Rectifier Networks

Expressiveness of Rectifier Networks Xingyuan Pan Vivek Srikumar The University of Utah, Salt Lake City, UT 84112, USA XPAN@CS.UTAH.EDU SVIVEK@CS.UTAH.EDU Abstract Rectified Linear Units (ReLUs have been shown to ameliorate the vanishing

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

arxiv: v1 [cs.cl] 21 May 2017

arxiv: v1 [cs.cl] 21 May 2017 Spelling Correction as a Foreign Language Yingbo Zhou yingbzhou@ebay.com Utkarsh Porwal uporwal@ebay.com Roberto Konow rkonow@ebay.com arxiv:1705.07371v1 [cs.cl] 21 May 2017 Abstract In this paper, we

More information

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang Deep Feedforward Networks Han Shao, Hou Pong Chan, and Hongyi Zhang Deep Feedforward Networks Goal: approximate some function f e.g., a classifier, maps input to a class y = f (x) x y Defines a mapping

More information

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Speaker Representation and Verification Part II. by Vasileios Vasilakakis Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Tips for Deep Learning

Tips for Deep Learning Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results

More information

Faster Training of Very Deep Networks Via p-norm Gates

Faster Training of Very Deep Networks Via p-norm Gates Faster Training of Very Deep Networks Via p-norm Gates Trang Pham, Truyen Tran, Dinh Phung, Svetha Venkatesh Center for Pattern Recognition and Data Analytics Deakin University, Geelong Australia Email:

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services

More information

Deep Neural Networks (1) Hidden layers; Back-propagation

Deep Neural Networks (1) Hidden layers; Back-propagation Deep Neural Networs (1) Hidden layers; Bac-propagation Steve Renals Machine Learning Practical MLP Lecture 3 4 October 2017 / 9 October 2017 MLP Lecture 3 Deep Neural Networs (1) 1 Recap: Softmax single

More information

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04 Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures Lecture 04 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Self-Taught Learning 1. Learn

More information

Tips for Deep Learning

Tips for Deep Learning Tips for Deep Learning Recipe of Deep Learning Step : define a set of function Step : goodness of function Step 3: pick the best function NO Overfitting! NO YES Good Results on Testing Data? YES Good Results

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

COMPLEX INPUT CONVOLUTIONAL NEURAL NETWORKS FOR WIDE ANGLE SAR ATR

COMPLEX INPUT CONVOLUTIONAL NEURAL NETWORKS FOR WIDE ANGLE SAR ATR COMPLEX INPUT CONVOLUTIONAL NEURAL NETWORKS FOR WIDE ANGLE SAR ATR Michael Wilmanski #*1, Chris Kreucher *2, & Alfred Hero #3 # University of Michigan 500 S State St, Ann Arbor, MI 48109 1 wilmansk@umich.edu,

More information

Jakub Hajic Artificial Intelligence Seminar I

Jakub Hajic Artificial Intelligence Seminar I Jakub Hajic Artificial Intelligence Seminar I. 11. 11. 2014 Outline Key concepts Deep Belief Networks Convolutional Neural Networks A couple of questions Convolution Perceptron Feedforward Neural Network

More information

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS 2018 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 17 20, 2018, AALBORG, DENMARK RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS Simone Scardapane,

More information

Machine Learning Lecture 14

Machine Learning Lecture 14 Machine Learning Lecture 14 Tricks of the Trade 07.12.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability

More information

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu Convolutional Neural Networks II Slides from Dr. Vlad Morariu 1 Optimization Example of optimization progress while training a neural network. (Loss over mini-batches goes down over time.) 2 Learning rate

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Introduction to Neural Networks

Introduction to Neural Networks CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character

More information

arxiv: v3 [cs.ne] 21 Apr 2015

arxiv: v3 [cs.ne] 21 Apr 2015 LEARNING ACTIVATION FUNCTIONS TO IMPROVE DEEP NEURAL NETWORKS Forest Agostinelli Department of Computer Science University of California - Irvine Irvine, CA 92697, USA {fagostin}@uci.edu arxiv:1412.6830v3

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Feature Design. Feature Design. Feature Design. & Deep Learning

Feature Design. Feature Design. Feature Design. & Deep Learning Artificial Intelligence and its applications Lecture 9 & Deep Learning Professor Daniel Yeung danyeung@ieee.org Dr. Patrick Chan patrickchan@ieee.org South China University of Technology, China Appropriately

More information

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections

More information

Topic 3: Neural Networks

Topic 3: Neural Networks CS 4850/6850: Introduction to Machine Learning Fall 2018 Topic 3: Neural Networks Instructor: Daniel L. Pimentel-Alarcón c Copyright 2018 3.1 Introduction Neural networks are arguably the main reason why

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions 2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh,

More information

Ch.6 Deep Feedforward Networks (2/3)

Ch.6 Deep Feedforward Networks (2/3) Ch.6 Deep Feedforward Networks (2/3) 16. 10. 17. (Mon.) System Software Lab., Dept. of Mechanical & Information Eng. Woonggy Kim 1 Contents 6.3. Hidden Units 6.3.1. Rectified Linear Units and Their Generalizations

More information

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35 Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How

More information

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Learning Deep Architectures for AI. Part I - Vijay Chakilam Learning Deep Architectures for AI - Yoshua Bengio Part I - Vijay Chakilam Chapter 0: Preliminaries Neural Network Models The basic idea behind the neural network approach is to model the response as a

More information

Denoising Autoencoders

Denoising Autoencoders Denoising Autoencoders Oliver Worm, Daniel Leinfelder 20.11.2013 Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 1 / 11 Introduction Poor initialisation can lead to local minima 1986 -

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information Mathias Berglund, Tapani Raiko, and KyungHyun Cho Department of Information and Computer Science Aalto University

More information

Deep Neural Networks (1) Hidden layers; Back-propagation

Deep Neural Networks (1) Hidden layers; Back-propagation Deep Neural Networs (1) Hidden layers; Bac-propagation Steve Renals Machine Learning Practical MLP Lecture 3 2 October 2018 http://www.inf.ed.ac.u/teaching/courses/mlp/ MLP Lecture 3 / 2 October 2018 Deep

More information

Importance Reweighting Using Adversarial-Collaborative Training

Importance Reweighting Using Adversarial-Collaborative Training Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a

More information

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo

More information

Fast Learning with Noise in Deep Neural Nets

Fast Learning with Noise in Deep Neural Nets Fast Learning with Noise in Deep Neural Nets Zhiyun Lu U. of Southern California Los Angeles, CA 90089 zhiyunlu@usc.edu Zi Wang Massachusetts Institute of Technology Cambridge, MA 02139 ziwang.thu@gmail.com

More information

arxiv: v2 [cs.ne] 22 Feb 2013

arxiv: v2 [cs.ne] 22 Feb 2013 Sparse Penalty in Deep Belief Networks: Using the Mixed Norm Constraint arxiv:1301.3533v2 [cs.ne] 22 Feb 2013 Xanadu C. Halkias DYNI, LSIS, Universitè du Sud, Avenue de l Université - BP20132, 83957 LA

More information

More Tips for Training Neural Network. Hung-yi Lee

More Tips for Training Neural Network. Hung-yi Lee More Tips for Training Neural Network Hung-yi ee Outline Activation Function Cost Function Data Preprocessing Training Generalization Review: Training Neural Network Neural network: f ; θ : input (vector)

More information

arxiv: v1 [cs.lg] 10 Aug 2018

arxiv: v1 [cs.lg] 10 Aug 2018 Dropout is a special case of the stochastic delta rule: faster and more accurate deep learning arxiv:1808.03578v1 [cs.lg] 10 Aug 2018 Noah Frazier-Logue Rutgers University Brain Imaging Center Rutgers

More information

Deep Learning Lab Course 2017 (Deep Learning Practical)

Deep Learning Lab Course 2017 (Deep Learning Practical) Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of

More information

arxiv: v1 [cs.lg] 20 Apr 2017

arxiv: v1 [cs.lg] 20 Apr 2017 Softmax GAN Min Lin Qihoo 360 Technology co. ltd Beijing, China, 0087 mavenlin@gmail.com arxiv:704.069v [cs.lg] 0 Apr 07 Abstract Softmax GAN is a novel variant of Generative Adversarial Network (GAN).

More information

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1 What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1 Multi-layer networks Steve Renals Machine Learning Practical MLP Lecture 3 7 October 2015 MLP Lecture 3 Multi-layer networks 2 What Do Single

More information

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung

Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Department of Electrical and Computer Engineering Seoul National University Seoul, 08826 Korea Email: sajid@dsp.snu.ac.kr, khwang@dsp.snu.ac.kr, wysung@snu.ac.kr

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

OPTIMIZATION METHODS IN DEEP LEARNING

OPTIMIZATION METHODS IN DEEP LEARNING Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate

More information

arxiv: v2 [cs.ne] 7 Apr 2015

arxiv: v2 [cs.ne] 7 Apr 2015 A Simple Way to Initialize Recurrent Networks of Rectified Linear Units arxiv:154.941v2 [cs.ne] 7 Apr 215 Quoc V. Le, Navdeep Jaitly, Geoffrey E. Hinton Google Abstract Learning long term dependencies

More information

arxiv: v1 [cs.lg] 9 Nov 2015

arxiv: v1 [cs.lg] 9 Nov 2015 HOW FAR CAN WE GO WITHOUT CONVOLUTION: IM- PROVING FULLY-CONNECTED NETWORKS Zhouhan Lin & Roland Memisevic Université de Montréal Canada {zhouhan.lin, roland.memisevic}@umontreal.ca arxiv:1511.02580v1

More information

Lecture 2: Learning with neural networks

Lecture 2: Learning with neural networks Lecture 2: Learning with neural networks Deep Learning @ UvA LEARNING WITH NEURAL NETWORKS - PAGE 1 Lecture Overview o Machine Learning Paradigm for Neural Networks o The Backpropagation algorithm for

More information

arxiv: v4 [cs.lg] 28 Jan 2014

arxiv: v4 [cs.lg] 28 Jan 2014 Low-Rank Approximations for Conditional Feedforward Computation in Deep Neural Networks arxiv:1312.4461v4 [cs.lg] 28 Jan 2014 Andrew S. Davis Department of Electrical Engineering and Computer Science University

More information

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning & Artificial Intelligence WS 2018/2019 Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target

More information

Deep Neural Networks

Deep Neural Networks Deep Neural Networks DT2118 Speech and Speaker Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 45 Outline State-to-Output Probability Model Artificial Neural Networks Perceptron Multi

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE- Workshop track - ICLR COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE- CURRENT NEURAL NETWORKS Daniel Fojo, Víctor Campos, Xavier Giró-i-Nieto Universitat Politècnica de Catalunya, Barcelona Supercomputing

More information

ANALYSIS ON GRADIENT PROPAGATION IN BATCH NORMALIZED RESIDUAL NETWORKS

ANALYSIS ON GRADIENT PROPAGATION IN BATCH NORMALIZED RESIDUAL NETWORKS ANALYSIS ON GRADIENT PROPAGATION IN BATCH NORMALIZED RESIDUAL NETWORKS Anonymous authors Paper under double-blind review ABSTRACT We conduct mathematical analysis on the effect of batch normalization (BN)

More information

Understanding Neural Networks : Part I

Understanding Neural Networks : Part I TensorFlow Workshop 2018 Understanding Neural Networks Part I : Artificial Neurons and Network Optimization Nick Winovich Department of Mathematics Purdue University July 2018 Outline 1 Neural Networks

More information

ECE521 Lecture 7/8. Logistic Regression

ECE521 Lecture 7/8. Logistic Regression ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression

More information

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 12 Feb 23, 2018 1 Neural Networks Outline

More information

Feedforward Neural Networks. Michael Collins, Columbia University

Feedforward Neural Networks. Michael Collins, Columbia University Feedforward Neural Networks Michael Collins, Columbia University Recap: Log-linear Models A log-linear model takes the following form: p(y x; v) = exp (v f(x, y)) y Y exp (v f(x, y )) f(x, y) is the representation

More information

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive

More information

UNSUPERVISED LEARNING

UNSUPERVISED LEARNING UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training

More information

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WITH IMPLICATIONS FOR TRAINING Sanjeev Arora, Yingyu Liang & Tengyu Ma Department of Computer Science Princeton University Princeton, NJ 08540, USA {arora,yingyul,tengyu}@cs.princeton.edu

More information

arxiv: v7 [cs.ne] 2 Sep 2014

arxiv: v7 [cs.ne] 2 Sep 2014 Learned-Norm Pooling for Deep Feedforward and Recurrent Neural Networks Caglar Gulcehre, Kyunghyun Cho, Razvan Pascanu, and Yoshua Bengio Département d Informatique et de Recherche Opérationelle Université

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

Convolutional Neural Networks. Srikumar Ramalingam

Convolutional Neural Networks. Srikumar Ramalingam Convolutional Neural Networks Srikumar Ramalingam Reference Many of the slides are prepared using the following resources: neuralnetworksanddeeplearning.com (mainly Chapter 6) http://cs231n.github.io/convolutional-networks/

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

How to do backpropagation in a brain

How to do backpropagation in a brain How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep

More information

Deep Learning for NLP

Deep Learning for NLP Deep Learning for NLP CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) Machine Learning and NLP NER WordNet Usually machine learning

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem

More information

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1) 11/3/15 Machine Learning and NLP Deep Learning for NLP Usually machine learning works well because of human-designed representations and input features CS224N WordNet SRL Parser Machine learning becomes

More information

An overview of deep learning methods for genomics

An overview of deep learning methods for genomics An overview of deep learning methods for genomics Matthew Ploenzke STAT115/215/BIO/BIST282 Harvard University April 19, 218 1 Snapshot 1. Brief introduction to convolutional neural networks What is deep

More information

Training Neural Networks Practical Issues

Training Neural Networks Practical Issues Training Neural Networks Practical Issues M. Soleymani Sharif University of Technology Fall 2017 Most slides have been adapted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017, and some from

More information

Random projection initialization for deep neural networks

Random projection initialization for deep neural networks Random projection initialization for deep neural networks Piotr Iwo Wo jcik and Marcin Kurdziel AGH University of Science and Technology, Faculty of Computer Science, Electronics and Telecommunications,

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN) EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN) TARGETED PIECES OF KNOWLEDGE Linear regression Activation function Multi-Layers Perceptron (MLP) Stochastic Gradient Descent

More information

Very Deep Residual Networks with Maxout for Plant Identification in the Wild Milan Šulc, Dmytro Mishkin, Jiří Matas

Very Deep Residual Networks with Maxout for Plant Identification in the Wild Milan Šulc, Dmytro Mishkin, Jiří Matas Very Deep Residual Networks with Maxout for Plant Identification in the Wild Milan Šulc, Dmytro Mishkin, Jiří Matas Center for Machine Perception Department of Cybernetics Faculty of Electrical Engineering

More information

Feed-forward Network Functions

Feed-forward Network Functions Feed-forward Network Functions Sargur Srihari Topics 1. Extension of linear models 2. Feed-forward Network Functions 3. Weight-space symmetries 2 Recap of Linear Models Linear Models for Regression, Classification

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Feedforward Neural Networks

Feedforward Neural Networks Feedforward Neural Networks Michael Collins 1 Introduction In the previous notes, we introduced an important class of models, log-linear models. In this note, we describe feedforward neural networks, which

More information

P-TELU : Parametric Tan Hyperbolic Linear Unit Activation for Deep Neural Networks

P-TELU : Parametric Tan Hyperbolic Linear Unit Activation for Deep Neural Networks P-TELU : Parametric Tan Hyperbolic Linear Unit Activation for Deep Neural Networks Rahul Duggal rahulduggal2608@gmail.com Anubha Gupta anubha@iiitd.ac.in SBILab (http://sbilab.iiitd.edu.in/) Deptt. of

More information

Compressing deep neural networks

Compressing deep neural networks From Data to Decisions - M.Sc. Data Science Compressing deep neural networks Challenges and theoretical foundations Presenter: Simone Scardapane University of Exeter, UK Table of contents Introduction

More information

Neural networks (NN) 1

Neural networks (NN) 1 Neural networks (NN) 1 Hedibert F. Lopes Insper Institute of Education and Research São Paulo, Brazil 1 Slides based on Chapter 11 of Hastie, Tibshirani and Friedman s book The Elements of Statistical

More information

IMPROVING STOCHASTIC GRADIENT DESCENT

IMPROVING STOCHASTIC GRADIENT DESCENT IMPROVING STOCHASTIC GRADIENT DESCENT WITH FEEDBACK Jayanth Koushik & Hiroaki Hayashi Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {jkoushik,hiroakih}@cs.cmu.edu

More information

Local Affine Approximators for Improving Knowledge Transfer

Local Affine Approximators for Improving Knowledge Transfer Local Affine Approximators for Improving Knowledge Transfer Suraj Srinivas & François Fleuret Idiap Research Institute and EPFL {suraj.srinivas, francois.fleuret}@idiap.ch Abstract The Jacobian of a neural

More information

Learning Deep Architectures

Learning Deep Architectures Learning Deep Architectures Yoshua Bengio, U. Montreal Microsoft Cambridge, U.K. July 7th, 2009, Montreal Thanks to: Aaron Courville, Pascal Vincent, Dumitru Erhan, Olivier Delalleau, Olivier Breuleux,

More information

CS60010: Deep Learning

CS60010: Deep Learning CS60010: Deep Learning Sudeshna Sarkar Spring 2018 16 Jan 2018 FFN Goal: Approximate some unknown ideal function f : X! Y Ideal classifier: y = f*(x) with x and category y Feedforward Network: Define parametric

More information

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses

More information

RAGAV VENKATESAN VIJETHA GATUPALLI BAOXIN LI NEURAL DATASET GENERALITY

RAGAV VENKATESAN VIJETHA GATUPALLI BAOXIN LI NEURAL DATASET GENERALITY RAGAV VENKATESAN VIJETHA GATUPALLI BAOXIN LI NEURAL DATASET GENERALITY SIFT HOG ALL ABOUT THE FEATURES DAISY GABOR AlexNet GoogleNet CONVOLUTIONAL NEURAL NETWORKS VGG-19 ResNet FEATURES COMES FROM DATA

More information