Neural Networks. Learning and Computer Vision Prof. Olga Veksler CS9840. Lecture 10

CS9840 Learning and Computer Vision Prof. Olga Veksler Lecture 0 Neural Networks Many slides are from Andrew NG, Yann LeCun, Geoffry Hinton, Abin - Roozgard

Outline Short Intro Perceptron ( layer NN) Multilayer Perceptron (MLP) Deep Networks Convolutional Network

Neural Networks Neural Networks correspond to some discriminant function g NN (x) Can carve out arbitrarily complex decision boundaries without requiring so many terms as polynomial functions Neural Nets were inspired by research in how human brain works But also proved to be quite successful in practice Are used nowadays successfully for a wide variety of applications took some time to get them to work now used by US post for postal code recognition x 2 x

Neuron: Basic Brain Processor Neurons (or nerve cells) are special cells that process and transmit information by electrical signaling in brain and also spinal cord Human brain has around 0 neurons A neuron connects to other neurons to form a network Each neuron cell communicates to anywhere from 000 to 0,000 other neurons

Neuron: Main Components dendrites cell body axon terminals cell body computational unit dendrites axon nucleus input wires, receive inputs from other neurons a neuron may have thousands of dendrites, usually short axon output wire, sends signal to other neurons single long structure (up to meter) splits in possibly thousands branches at the end, axon terminals 5

ANN History: First Successes 958, F. Rosenblatt, Cornell University perceptron, oldest neural network still in use today that s what we studied in lecture on linear classifiers Algorithm to train the perceptron network Built in hardware Proved convergence in linearly separable case initially seemed promising, but too many claims were made

ANN History: Stagnation Early success lead to a lot of claims which were not fulfilled 969, M. Minsky and S. Pappert Book Perceptrons Proved that perceptrons can learn only linearly separable classes In particular cannot learn very simple XOR function Conectured that multilayer neural networks also limited by linearly separable functions No funding and almost no research (at least in North America) in 970 s as the result of 2 things above

ANN History: Revival Revival of ANN in 980 s 982, J. Hopfield New kind of networks (Hopfield s networks) Not ust model of how human brain might work, but also how to create useful devices Implements associative memory 982 oint US-Japanese conference on ANN US worries that it will stay behind Many examples of mulitlayer Neural Networks appear 986, re-discovery of backpropagation algorithm by Werbos, Rumelhart, Hinton and Ronald Williams Allows a network to learn not linearly separable classes Lots of successes in the past 5 years due to better training

Artificial Neural Nets (ANN): Perceptron bias unit layer input layer x x 2 w 0 w layer 2 output layer h()=sign() w 2 sign(w t x+w 0 ) w 3 x 3 Linear classifier f(x) = sign(w t x+w 0 ) is a single neuron net Input layer units output features, except bias outputs Output layer unit applies sign() or some other function h() h() is also called an activation function

Multilayer Neural Network (MLP) layer Input layer layer 2 hidden layer layer 3 output layer x x 2 w w h( w h( )+w h( ) ) x 3 First hidden unit outputs: h( ) = h(w 0 +w x +w 2 x 2 +w 3 x 3 ) Second hidden unit outputs: h( ) = h(w 0 +w x +w 2 x 2 +w 3 x 3 ) Network corresponds to classifier f(x) = h( w h( )+w h( ) ) More complex than Perceptron, more complex boundaries

MLP Small Example layer : input x x 2 3 6 5 layer 2: hidden 7 4 2 3 layer 3: output Let activation function h() = sign() MLP Corresponds to classifier f(x) = sign( 4 h( )+2 h( ) + 7 ) = sign(4 sign(3x +5x 2 )+2 sign(6+3x 2 ) + 7) MLP terminology: computing f(x) is called feed forward operation graphically, function is computed from left to right Edge weights are learned through training

MLP: Multiple Classes layer Input layer layer 2 hidden layer layer 3 output layer x x 2 3 classes, 2 features, hidden layer 3 input units, one for each feature 3 output units, one for each class 2 hidden units bias unit, usually drawn in layer

MLP: General Structure layer Input layer x x 2 layer 2 hidden layer layer 3 output layer h(...) h(...) h(...) = f (x) = f 2 (x) = f 3 (x) f(x) = [f (x), f 2 (x), f 3 (x)] is multi-dimensional Classification: If f (x) is largest, decide class If f 2 (x) is largest, decide class 2 If f 3 (x) is largest, decide class 3

MLP: General Structure layer Input layer layer 2 hidden layer layer 3 output layer x x 2 Input layer: d features, d input units Output layer: m classes, m output units Hidden layer: how many units?

MLP: General Structure layer Input layer layer 2 hidden layer layer 3 hidden layer layer 4 output layer w w w h(...) x w x 2 Can have more than hidden layer ith layer connects to (i+)th layer except bias unit can connect to any layer can have different number of units in each hidden layer First output unit outputs: h(...) = h( w h( ) + w ) = h( w h(w h( ) + w h( )) + w )

MLP: Activation Function h() = sign() is discontinuous, not good for gradient descent Instead can use continuous sigmoid function Rectified Linear is gaining popularity recently (ReLu) Or another differentiable function Can even use different activation functions at different layers/units From now, assume h() is a differentiable function

MLP: Overview A neural network corresponds to a classifier f(x,w) that can be rather complex complexity depends on the number of hidden layers/units f(x,w) is a composition of many functions easier to visualize as a network notation gets ugly To train neural network, ust as before formulate an obective function J(w) optimize it with gradient descent That s all! Except we need quite a few slides to write down details due to complexity of f(x,w)

Expressive Power of MLP Every continuous function from input to output can be implemented with enough hidden units, hidden layer, and proper nonlinear activation functions easy to show that with linear activation function, multilayer neural network is equivalent to perceptron This is more of theoretical than practical interest Proof is not constructive (does not tell how construct MLP) Even if constructive, would be of no use, we do not know the desired function, our goal is to learn it through the samples But this result gives confidence that we are on the right track MLP is general (expressive) enough to construct any required decision boundaries, unlike the Perceptron

Decision Boundaries Perceptron (single layer neural net) Arbitrarily complex decision regions Even not contiguous

Nonlinear Decision Boundary: Example Start with two Perceptrons, h() = sign() x + x 2 > 0 class - x - x 2 3 > 0 class -3 x - x - x 2 x 2 x 2 x 2 - x 3 x -3

Nonlinear Decision Boundary: Example Now combine them into a 3 layer NN - - x -3 x 2 -.5 - x 2 x 2 x 2 + x x 3 - x 3-3 -3

MLP: Modes of Operation For Neural Networks, due to historical reasons, training and testing stages have special names Backpropagation (or training) Minimize obective function with gradient descent Feedforward (or testing)

x d. MLP: Notation for Edge Weights layer input layer 2 hidden layer k- hidden layer k output unit x.. w k m. unit d bias unit or unit 0 w k 0m w k p is edge weight from unit p in layer k- to unit in layer k w k 0 is edge weight from bias unit to unit in layer k w k is all weights to unit in layer k, i.e. w k 0, w k,, w k N(k-) N(k) is the number of units in layer k, excluding the bias unit

x MLP: More Notation layer layer 2 layer 3 z 0 = z 3 2 = h( ) x 2 z 2 = x 2 z 2 2 = h( ) Denote the output of unit in layer k as z k For the input layer (k=), z 0 = and z = x, 0 For all other layers, (k > ), z k = h( ) Convenient to set z k 0 = for all k Set z k = [z k 0, z k,, z k N(k)]

MLP: More Notation layer layer 2 layer 3 x x 2 Net activation at unit in layer k > is the sum of inputs a k N k = p= z k p w k p + w k 0 a + For k >, z k = h(a k ) N k = p= 0 2 2 2 2 = z0w0 + zw z2w2 k k z p wp k k = z w

MLP: Class Representation m class problem, let Neural Net have t layers Let x i be a example of class c It is convenient to denote its label as y i = row c 0 0 Recall that z t c is the output of unit c in layer t (output layer) f(x)= z t =. If x i is of class c, want z t = t m t c t z z z 0 0 row c

Training MLP: Obective Function Want to minimize difference between y i and f(x i ) Use squared difference Let w be all edge weights in MLP collected in one vector m ( ) ( ( i = ) i ) 2 Error on one example x i : Ji w fc x yc 2 Error on all examples: J 2 c= n m ( ) ( ( i) i w = f ) c x yc i= c= 2 Gradient descent: initialize w to random choose ε, α while α J(w) > ε w = w - α J(w)

Training MLP: Single Sample For simplicity, first consider error for one example x i m i ( ) ( i) 2 = = ( ( i) i ) 2 Ji w y f x fc x yc 2 2 f c (x i ) depends on w y i is independent of w Compute partial derivatives w.r.t. w k p for all k, p, Suppose have t layers f c c= ( i) t ( t ) ( t t x = z = h a = h z w ) c c c

Training MLP: Single Sample For derivation, we use: For weights w t p to the output layer t: w t p w J ( ) ( ( i) i ) ( ( i w = f x y f x ) y ) w t p J i w t p ( ( i) i ) ( t ) t f x y = h' a z t p Therefore, ( ) p J i ( ) ( ( i) i w = f ) c x yc i 2 m c= ( ) ( ( i) i ) ( t ) t w = f x y h' a z f c ( i) ( t ) ( t t x = h a = h z w ) t t both h' a and depend on x i z p. For simpler notation, we don t make this dependence explicit. c p 2 c

Training MLP: Single Sample For a layer k, compute partial derivatives w.r.t. w k p Gets complex, since have lots of function compositions Will give the rest of derivatives First define e k, the error attributed to unit in layer k: For layer t (output): For layers k < t: Thus for 2 k t: e e t k w = = k p ( ( i f x ) y ) N J i ( k+ ) c= e i h' k ( ) ( k ) k w = e h' a z ( k+ ) k a w k+ + c c c p

MLP Training: Multiple Samples ( ) ( ) ( ) = = = n i m c i c i c y x f w J 2 2 Error on all examples: ( ) ( ) ( ) = = m c i c i c i y x f w J 2 2 Error on one example x i : ( ) ( ) = k p k k i k p z a h' e w J w ( ) ( ) = = n i k p k k k p z a h' e w J w

Batch Protocol true gradient descent Training Protocols weights are updated only after all examples are processed might be very slow to train Single Sample Protocol examples are chosen randomly from the training set weights are updated after every example weighs get changed faster than batch, less stable Mini Batch Update weights after processing a batch of examples Middle ground between single sample and batch protocols Helps to prevent over-fitting in practice think of it as noisy gradient allows CPU/GPU memory hierarchy to be exploited so that it trains much faster than single-sample in terms of wall-clock time

MLP Training: Single Sample initialize w to small random numbers choose ε, α while α J(w) > ε for i = to n r = random index from {,2,,n} delta pk = 0 p,,k e t = ( ( r ) r f x y ) for k = t to 2 delta = delta e k pk N ( k) = e c= h' pk e w k p = w k p + delta pk p,,k k c k h' ( k ) k a w c c ( k ) k a z p

MLP Training: Batch initialize w to small random numbers choose ε, α while α J(w) > ε for i = to n delta pk = 0 p,,k e t = ( ( i) i f x y ) for k = t to 2 delta = delta e k pk N w k p = w k p + delta pk p,,k ( k) = e c= k c h' pk e k h' ( k ) k a w c c ( k ) k a z p

BackPropagation of Errors In MLP terminology, training is called backpropagation errors computed (propagated) backwards from the output to the input layer while α J(w) > ε for i = to n delta pk = 0 p,,k e t = ( r ( r y f x ) for k = t to 2 delta = delta e k pk N ( k) = e c= h' pk w k p = w k p + delta pk p,,k k c first last layer errors computed then errors computed backwards e k h' ( k ) k a w c c ( k ) k a z p

MLP Training Important: weights should be initialized to random nonzero numbers k ( ) ( k ) k Ji w = e k h' a zp w e k p = ( k+ ) c= ( k+ ) k a w k+ + c c c if w k c = 0, errors e k are zero for layers k < t weights in layers k < t will not be updated N e h'

Practical tips for BP: Weight Decay To avoid overfitting, it is recommended to keep weights small Implement weight decay after each weight update: w new = w new (-β), 0 < β < Additional benefit is that unused weights grow small and may be eliminated altogether a weight is unused if it is left almost unchanged by the backpropagation algorithm

Practical Tips for BP: Momentum Gradient descent finds only a local minima Momentum: popular method to avoid local minima and speed up descent in flat (plateau) regions Add temporal average direction in which weights have been moving recently Previous direction: w t =w t -w t- Weight update rule with momentum: w t+ = w t + J w t ( β) α + β w steepest descent direction previous direction

Practical Tips for BP: Normalization Features should be normalized for faster convergence Suppose we measure fish length in meters and weight in grams Typical sample [length = 0.5, weight = 3000] Feature length will be almost ignored If length is in fact important, learning will be very slow Any normalization we looked at before (lecture on knn) will do Test samples should be normalized exactly as the training samples

Practical Tips: Learning Rate As any gradient descent algorithm, backpropagation depends on the learning rate α Rule of thumb α = 0. However can adust α at the training time The obective function J(w) should decrease during gradient descent If J(w) oscillates, α is too large, decrease it If J(w) goes down but very slowly, α is too small, increase it

MLP Training: How long to Train? training time Large training error: random decision regions in the beginning - underfit Small training error: decision regions improve with time Zero training error: decision regions fit training data perfectly - overfit can learn when to stop training through validation

MLP as Non-Linear Feature Mapping x x 2 MLP can be interpreted as first mapping input features to new features Then applying Perceptron (linear classifier) to the new features

MLP as Non-Linear Feature Mapping y x y 2 x 2 y 3 this part implements Perceptron (liner classifier)

MLP as Non-Linear Feature Mapping y x y 2 x 2 y 3 this part implements mapping to new features y

MLP as Nonlinear Feature Mapping Consider 3 layer NN example we saw previously: + - - x -3.5 x 2 - x 2 y 2 x y non linearly separable in the original feature space linearly separable in the new feature space

Shallow vs. Deep Architecture How many layers should we choose? Shallow network Deep network Deep network lead to many successful applications recently

Why Deep Networks 2 layer networks can represent any function But deep architectures are more efficient for representing some classes of functions problems which can be represented with a polynomial number of nodes with k layers, may require an exponential number of nodes with k- layers thus with deep architecture, less units might be needed overall less weights, less parameter updates maybe especially in image processing, with structure being mainly local Sub-features created in deep architecture can potentially be shared between multiple tasks

Why Deep Networks: Hierarchical Feature Extraction Deep architecture works well for hierarchical feature extraction hierarchies are natural, especially in vision Each stage is a trainable feature transform Level of abstraction increases with the level Input layer: pixels First layer: edges Second layer: obect parts (combination of edges) Third layer: obects (combinations of obect parts)

Why Deep Networks: Hierarchical Feature Extraction Another example (from M. Zeiler 203) Visualization of learned features Patches that result in high response Layer Layer 2

Why Deep Networks: Hierarchical Feature Extraction Visualization of learned features Patches that result in high response Layer 3 Layer 4

Early Work Fukushima (980) Neo-Cognitron LeCun (998) Convolutional Neural Networks Similarities to Neo-Cognitron Many layered Networks trained with backpropagation Tried early but without much success Very slow Diffusion of gradient recent work has shown significant training improvements with various tricks (drop-out, unsupervised learning of early layers, etc.)

Prior Knowledge for Network Architecture We can put our prior knowledge about the task into the network by designing appropriate connectivity structure weight constraints neuron activation functions This is less intrusive than hand-designing the features but it still preudices the network towards the particular way of solving the problem that we had in mind

Convolutional Nets Neural Networks with special type of architecture that is particularly good for image processing input image feature maps subsample feature subsample final layer maps C S Convolution layer: feature extraction Subsampling layer: shift and distortion invariance C S

Feature extraction or Convolution layer Use convolution (and thresholding) to detect the same feature at different image positions convolve with w w2 w2 w22 w3 w23 threshold w3 w32 w33 Recall how convolution works: neuron w w2 w3 w w2 w3 w2 w22 w23 w2 w22 w23 w3 w32 w33 w3 w32 w33 input image convolved thresholded image = feature map

Feature extraction Convolution masks are learned tunable parameters of the network Shared weights: all neurons in a feature share the same weights but not the biases Thus all neurons detect the same feature at different positions in the input image if a feature is useful in one image location, it should be useful in all other locations also greatly reduces the number of tunable parameters The red connections all have the same weight The red connections all have the same weight The red connections all have the same weight

Weight Sharing Constraints It is easy to modify backpropagation algorithm to incorporate weight sharing We compute the gradients as usual, and then modify the gradients so that they satisfy the constraints. so if the weights started off satisfying the constraints, they will continue to satisfy them

Feature extraction or Convolution layer features Use several different feature types, each with its own mask, and the resulting map of replicated detectors Allows each patch of image to be represented in several ways feature map (convolve with mask ) feature map 2 (convolve with mask 2) feature map 3 (convolve with mask 3) feature map 4 (convolve with mask 4) feature map 5 (convolve with mask 5)

Subsampling layers reduce spatial resolution of each feature map weighted sum This achieves certain degree of shift and distortion invariance Weight sharing is also applied in subsampling layers reduce the effect of noise and shift or distortion feature map Subsampling Layer subsampled

Subsampling Layer Subsampling is also called pooling Instead of subsampling, sometimes max pooling works betters replace weighted sum with max operation feature map subsampled

Subsampling Layer Subsampling achieves certain degree of shift and distortion invariance is

Subsampling Layer

62 Convolutional Nets

Problem with Subsampling (Pooling) Averaging (or taking a max) of four pixels gets a small amount of shift invariance Problem to be aware of After several levels of pooling, we have lost information about the precise positions of things This makes it impossible to use the precise spatial relationships between high-level parts for recognition.

Le Net Yann LeCun and his collaborators developed a good recognizer for handwritten digits by using backpropagation in a convoluitonal net LeNet uses knowledge about the invariance to design local connectivity weight-sharing pooling This net was used for reading ~0% of the checks in North America Look the impressive demos of LENET at http://yann.lecun.com

Conv Nets: Character Recognition http://yann.lecun.com/exdb/lenet/index.html

ConvNet for ImageNet Krizhevsky et.al.(nips 202) developed deep convolutional neural net of the type pioneered by Yann LeCun Architecture: 7 hidden layers not counting some max pooling layers. the early layers were convolutional. the last two layers were globally connected. Activation function: rectified linear units in every hidden layer train much faster and are more expressive than logistic unit

ConvNet on Image Classification

Tricks to Improve Generalization To get more data: Use left-right reflections of the images Train on random 224x224 patches from the 256x256 images At test time: combine the opinions from ten different patches: four 224x224 corner patches plus the central 224x224 patch the reflections of those five patches Use dropout to regularize weights in the fully connected layers half of the hidden units in a layer are randomly removed for each training example This stops hidden units from relying too much on other hidden units

Training Deep Networks Difficulties of supervised training of deep networks Early layers of MLN do not get trained well Diffusion of Gradient error attenuates as it propagates to earlier layers Exacerbated since top couple layers can usually learn any task "pretty well" and thus the error to earlier layers drops quickly as the top layers "mostly" solve the task lower layers never get the opportunity to use their capacity to improve results, they ust do a random feature map Need a way for early layers to do effective work Often not enough labeled data available while there may be lots of unlabeled data Can we use unsupervised/semi-supervised approaches to take advantage of the unlabeled data Deep networks tend to have more local minima problems than shallow networks during supervised training

Greedy Layer-Wise Training Greedy layer-wise training to insure lower layers learn. Train first layer using your data without the labels (unsupervised) we do not know targets at this level anyway can use the more abundant unlabeled data which is not part of the training set 2. Freeze the first layer parameters and start training the second layer using the output of the first layer as the unsupervised input to the second layer 3. Repeat this for as many layers as desired This builds our set of robust features 4. Use the outputs of the final layer as inputs to a supervised layer/model and train the last supervised layer(s) leave early weights frozen 5. Unfreeze all weights and fine tune the full network by training with a supervised approach, given the pre-processed weight settings

Greedy Layer-Wise Training Greedy layer-wise training avoids many of the problems of trying to train a deep net in a supervised fashion Each layer gets full learning focus in its turn since it is the only current "top" layer Can take advantage of the unlabeled data When you finally tune the entire network with supervised training the network weights have already been adusted so that you are in a good error basin and ust need fine tuning This helps with problems of Ineffective early layer learning Deep network local minima

Unsupervised Learning: Auto-Encoders A type of unsupervised learning which tries to discover generic features of the data Input sample = output Learn identity function by learning important sub-features, not by ust passing through data bottleneck

Advantages MLP can learn complex mappings from inputs to outputs, based only on the training samples Easy to incorporate a lot of heuristics Many competitions won recently Disadvantages Concluding Remarks May be difficult to analyze and predict its behavior May take a long time to train May get trapped in a bad local minima A lot of tricks for successful implementation