Convolutional Neural Networks. Srikumar Ramalingam

Similar documents
Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Introduction to Convolutional Neural Networks (CNNs)

Jakub Hajic Artificial Intelligence Seminar I

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Convolutional Neural Networks

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Lecture 17: Neural Networks and Deep Learning

Based on the original slides of Hung-yi Lee

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Introduction to Machine Learning (67577)

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

CS 1674: Intro to Computer Vision. Final Review. Prof. Adriana Kovashka University of Pittsburgh December 7, 2016

Introduction to (Convolutional) Neural Networks

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Lecture 3 Feedforward Networks and Backpropagation

1 What a Neural Network Computes

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Statistical Machine Learning

Bayesian Networks (Part I)

RegML 2018 Class 8 Deep learning

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

SGD and Deep Learning

Neural networks and support vector machines

Deep Learning Lab Course 2017 (Deep Learning Practical)

CSCI 315: Artificial Intelligence through Deep Learning

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

Deep Feedforward Networks. Sargur N. Srihari

Convolutional Neural Network Architecture

Introduction to Deep Learning

Deep learning attracts lots of attention.

From perceptrons to word embeddings. Simon Šuster University of Groningen

Introduction to Deep Neural Networks

Artificial Neural Networks

Neural networks and optimization

Deep Neural Networks (1) Hidden layers; Back-propagation

Lecture 3 Feedforward Networks and Backpropagation

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

ECE521 Lectures 9 Fully Connected Neural Networks

Introduction to Deep Learning

Convolutional neural networks

Neural networks and optimization

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Demystifying deep learning. Artificial Intelligence Group Department of Computer Science and Technology, University of Cambridge, UK

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Neural Networks: Backpropagation

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Deep Learning. Hung-yi Lee 李宏毅

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Neural Networks and Deep Learning.

Deep Feedforward Networks

Reconnaissance d objetsd et vision artificielle

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Neural Networks and Deep Learning

Grundlagen der Künstlichen Intelligenz

How to do backpropagation in a brain

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Deep Belief Networks are compact universal approximators

Deep Neural Networks (1) Hidden layers; Back-propagation

Deep Learning. Convolutional Neural Network (CNNs) Ali Ghodsi. October 30, Slides are partially based on Book in preparation, Deep Learning

Let your machine do the learning

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 8. Sinan Kalkan

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Convolutional Neural Network. Hung-yi Lee

Stochastic gradient descent; Classification

Normalization Techniques in Training of Deep Neural Networks

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Pattern Recognition and Machine Learning. Artificial Neural networks

Machine Learning Lecture 10

Deep Learning (CNNs)

Classification with Perceptrons. Reading:

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

COMPLEX INPUT CONVOLUTIONAL NEURAL NETWORKS FOR WIDE ANGLE SAR ATR

Handwritten Indic Character Recognition using Capsule Networks

Spatial Transformer Networks

>TensorFlow and deep learning_

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Lecture 7 Convolutional Neural Networks

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Modularity Matters: Learning Invariant Relational Reasoning Tasks

Feature Design. Feature Design. Feature Design. & Deep Learning

Deep Feedforward Networks

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Intro to Neural Networks and Deep Learning

Transcription:

Convolutional Neural Networks Srikumar Ramalingam

Reference Many of the slides are prepared using the following resources: neuralnetworksanddeeplearning.com (mainly Chapter 6) http://cs231n.github.io/convolutional-networks/ Marc'Aurelio Ranzato Deep learning tutorial in CVPR 2014

Introduction Deep learning allows computational models that are composed of multiple layers to learn representations of data. Significantly improved state-of-the-art results in speech recognition, visual object recognition, object detection, drug discovery and genomics. deep comes from having multiple layers of non-linearity [Source: Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, Deep Learning, Nature 2015]

Introduction neural is used because it is loosely inspired by neuroscience. The goal is generally to approximate some function f, e.g., consider a classifier y = f x : We define a mapping y = f θ, x and learn the value of the parameters θ that result in the best function approximation. Feedforward network is a specific type of deep neural network where information flows through the function being evaluated from input x through the intermediate computations used to define f, and finally to the output y.

Perceptron A perceptron takes several Boolean inputs (x 1, x 2, x 3 ) and returns a Boolean output. The weights (w 1, w 2, w 3 ) and the threshold are real numbers.

The first learning machine: the Perceptron Built at Cornell in 1960 It s an old paradigm The Perceptron was a linear classifier on top of a simple feature extractor The vast majority of practical applications of ML today use glorified linear classifiers or glorified template matching. Designing a feature extractor requires considerable efforts by experts. A N y=sign( i=1 Feature Extractor W i W i F i (X )+b) Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Motivation for CNNs Consider an input with 28x28 = 784 values. Consider 3 fully connected hidden layers. We can achieve a result of about 98% with just fully connected layers in the case of MNIST digit recognition dataset.

Fully Connected Layer Example: 300x300 image 40K hidden units ~4B parameters!!! Fully connected layers do not take into account the spatial structure of the images. For instance, it treats input pixels which are far apart and close together on exactly the same footing. - Spatial correlation is local - Waste of resources + we don t have enough training samples anyway.. Slide Credit: Marc'Aurelio Ranzato 8

Locally Connected Layer Example: 300x300 image 40K hidden units Filter size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face recognition). Slide Credit: Marc'Aurelio Ranzato 9

Locally Connected Layer Example: 300x300 image 40K hidden units Filter size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face recognition). Slide Credit: Marc'Aurelio Ranzato 10

Convolutional Layer Share the same parameters across different locations (assuming input is stationary): Convolutions with learned kernels Slide Credit: Marc'Aurelio Ranzato 11

Local Receptive Fields in CNNs Local receptive fields each neuron in the hidden layer will be connected to a small window of input neurons, say 5x5 region, corresponding to 25 input neurons. Each hidden neuron can be thought of analyzing its local receptive field.

Local Receptive Fields in CNNs We slide the local receptive field over by one pixel to the right (i.e., by one neuron), to connect to a second hidden neuron. If we have a 28x28 image and 5x5 receptive field, we will have 24x24 hidden neurons.

Stride Length Sometimes we slide the local receptive field over by more than one pixel to the right (or down). In that case the stride length could be 2 or more. This will lead to fewer hidden neurons. For example, in the case of a 28x28 image with 5x5 receptive field and stride length 2, we will have just 12x12 hidden neurons.

Shared Weights and Biases j, k th hidden neuron σ σ is the activation function such as sigmoid unit, b is the bias unit, w l,m is the weight term, and a x,y is the input variable at location x, y. Note that the weights and the bias terms are the same for all the hidden neurons in one feature map. All the neurons in the first hidden layer detect exactly the same feature, just at different locations in the image, e.g, cats.

Sigmoid neuron A sigmoid neuron can take real numbers (x 1, x 2, x 3 ) and returns a number within 0 to 1. The weights (w 1, w 2, w 3 ) and the bias term b are real numbers. Sigmoid function

Rectified linear neuron Preferred choice for many computer vision problems.

Convolutional Layer Slide Credit: Marc'Aurelio Ranzato 18

Convolutional Layer Slide Credit: Marc'Aurelio Ranzato 19

Convolutional Layer Slide Credit: Marc'Aurelio Ranzato 20

Convolutional Layer * -1 0 1-1 0 1-1 0 1 = Slide Credit: Marc'Aurelio Ranzato 21

Feature maps 3 feature maps We call this map from the input to the hidden layer as the feature map. The weights are called shared weights and biases are called shared biases, and they are only shared in one feature map. Different feature maps have different weights and biases.

Pooling Layers A pooling layer takes each feature map output from the convolutional layer and prepares a condensed feature map. For instance, each unit in the pooling layer may summarize a region of neurons in the previous layer. As a concrete example, one common procedure for pooling is known as max-pooling, e.g., a pooling unit can output the maximum activation in the 2x2 input region.

Pooling Layers L2 pooling square root of the sum of the squares of the 2x2 regions several pooling options exist (e.g., average pooling)

D D M D-K+1 D-K+1 Assumption: zero-padding and stride 1 N

With padding P and Stride S Input : D x D x M there are M input channels (M@D x D) Let us assume that we have kernels of size K x K Let us assume that we have N output channels. What is the dimensions of the output? M@( (D K + 2P)/S + 1) x (D K + 2P)/S + 1) ) Example in 1D: D=5, P=1,S=1,K=3 D=5, P=1,S=2,K=3 Filter in 1D Input with padding

Stride S = 2, Padding P = 1, input size D=5, filter size K=3

Why use padding? If there is no padding, then the size of the output would reduce by a small amount after each CONV, and the information at the borders would be washed away too quickly.

D D/K D M D/K M

With overlap? Stride S and Filter size K Output dimension? M@( ((D-K)/S+1) x ((D-K)/S + 1) )

What should I set the size of the pools? It depends on how much robust or invariant we want the representation to be. It is best to pool slowly using a sequence of convolution layers (i.e., each pooling is used after a sequence of conv layers).

Getting rid of the pooling layer To reduce the size of the representation use larger stride in CONV layer once in a while. Discarding pooling layers has also been found to be important in training good generative models, such as variational autoencoders (VAEs) or generative adversarial networks (GANs). It seems likely that future architectures will use very few to no pooling layers. Reference: Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, Martin Riedmiller, Striving for Simplicity: The All Convolutional Net,2014.

Conv->FC and FC->Conv Convolution layer can be seen as FC layer and FC layer can be seen as convolution.

MNIST data Each grayscale image is of size 28x28. 60,000 training images and 10,000 test images 10 possible labels (0,1,2,3,4,5,6,7,8,9) One of the very early datasets for neural networks, but still actively used by researchers for testing their algorithms.

Performance on MNIST is near perfect 33 out of 10,000 images are misclassified. Top Right: correct Bottom Right: misclassfied Using several ideas: convolutions, pooling, the use of GPUs to do far more training than we did with shallow networks, the algorithmic expansion of our training, dropouts, etc.

Digit recognition using 3 layers Example outputs: 6 -> [0 0 0 0 0 0 1 0 0 0 ] Input normalized to a value between 0 and 1.

Matrix equations for neural networks The indices j and k seem a little counter-intuitive!

Layer to layer relationship a j l = σ(z j l ) z j l = k w l jk a l 1 l k + b j b j l is the bias term in the jth neuron in the lth layer. a j l is the activation in the jth neuron in the lth layer. z j l is the weighted input to the jth neuron in the lth layer.

Itty # Layers Output F G µwi '. =a F. 3 z age # toaie±t 0%1 Layering 0 l I g#*r L µ gwf =)af=r( o LayI Lage,ai * + owzjajt cizzaz 5) do ) ) - ( tool +, ±- (boy =L, = # w=ht=3 44f= # bias terms = 4 + 2

Cost function from the network Groundtruth for each input Output activation vector for a specific training sample x. # of input samples for each input sample

Cost function parameters to compute # of input samples input -> x vector output -> a We assume that the network approximates a function y x and outputs a. We use a quadratic cost function, i.e., mean squared error or MSE.

Cost function Can the cost function be negative in the above example? What does it mean when the cost is approximately equal to zero?

Gradient Descent Small changes in parameters to leads to small changes in output Gradient vector! Change the parameter using learning rate (positive) and gradient vector! Update rule! Let us consider a cost function C v 1, v 2 that depends on two variables. The goal is to change the two variables to minimize the cost function.

Cost function from the network parameters to compute # of input samples What are the challenges in gradient descent when you have a large number of training samples? Gradient from a set of training samples.

- 5 Consider a Simple case b) # =5( wxtb ) = c= ( y 22 G) ) ite # I = - 2W ( y - SGD see ) - Graph for (a) 0=#5( ± # =) zc # will 5423 is be very Small when 2 > 4 very 2<-4 - Small His leads to learning Slowdown in MSE ( Mean Square or very and error ) Cay.

Cross-entropy loss function C = 1 n σ x[ylna + 1 y ln 1 a ] n is the total number of items of training data x is the input y is the required output and a is the output from the neuron " " a " i. a - 2-c.la# can be interpreted as the probability f " output the to output be Claes ±e for probability clan to be ""=

log ' ' - disorderliness " - Entropy N 6. B L, # 9 s.net E - { P ; loge to layover CI P( H = ) of i= I PCT ) - 0.5 E = - PCA ) log ZPCH ) - PG )l g[ G) = - o 's log ( ) - otl.sk ) = - ( = Elogastlos ;] = log 22 e 1 bit to transmit It or Tail. If P ( H ) = 1 PG ) = o E = - PCH ) leg, PCH ) - PG ) log - PG ) = - a log 1 - = o If it is always = " # " no need to p. transmit any info

Cross-entropy loss function C = 1 n x [ylna + 1 y ln 1 a ] n is the total number of items of training data x is the input y is the required output and a is the output from the neuron The cost function is non-negative, i.e., ln a is negative whenever 0 a 1. If the neuron s actual output is close to the desired output, then the cost function is close to 0.

Consider the Simple case To ( cy b) - T (2) = 5 ( cont to ) ydnatc C= [. - g) lnc a)] = *i ±a C Therein. =a[.sn?asi ] = ( a y ) - aa ] 5 '( z ) that =.E D causes learn ; > Slowdown 2 a - r T( a y ] - - 2) ( 1- on G ) ) \(a-y## ( = ))x] if a ( - y ) * how

Derivative of Sigmoid function It e- 2 T 'C = ( 252ft i. e- ) ( e- 2) G) 5C =1= = e- Get 2 =C )( IET = - c ( i - ) - ' (2) = - (2) ( I - ec ) -

Cross-entropy loss for multiple neurons C = 1 n σ x σ j [y j ln a j L + 1 y j ln(1 a j L )] The desired values of the output neurons are given by y 1, y 2, The actual output values are given by a 1 L, a 2 L,

SoftMax layer New kind of output layer a j L = L ez j σ k e z k L The output activations are guaranteed to sum to 1. Could be interpreted as probabilities.

Stochastic gradient descent The idea is to compute the gradient using a small set of randomly chosen training data. We assume that the average gradient obtained from the small set is close to the gradient obtained from the entire set.

Stochastic gradient descent Let us consider a mini-batch with m randomly chosen samples. Provided that the sample size is large enough, we expect the average gradient from the m samples is approximately equal to the average gradient from all the n samples. Backpropagation is a method to compute the gradients!

Thank You