SGD and Deep Learning

Similar documents
Neural Networks: Introduction

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Introduction to Convolutional Neural Networks (CNNs)

Neural Networks: Backpropagation

A summary of Deep Learning without Poor Local Minima

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Lecture 17: Neural Networks and Deep Learning

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Understanding How ConvNets See

Neural networks and optimization

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Introduction Biologically Motivated Crude Model Backpropagation

Deep Learning: a gentle introduction

Introduction to Machine Learning (67577)

Introduction to Deep Learning

Introduction to Deep Learning

ECE521 Lecture 7/8. Logistic Regression

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

CSC321 Lecture 5: Multilayer Perceptrons

Lecture 16: Introduction to Neural Networks

Machine Learning (CSE 446): Neural Networks

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Neural Networks and Deep Learning

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Artificial Neural Networks

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Classification Logistic Regression

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Deep Learning (CNNs)

CSCI567 Machine Learning (Fall 2018)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Lecture 3 Feedforward Networks and Backpropagation

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Feedforward Neural Nets and Backpropagation

Convolutional Neural Networks

CSC242: Intro to AI. Lecture 21

CSC321 Lecture 9: Generalization

Neural networks. Chapter 20. Chapter 20 1

Neural networks. Chapter 19, Sections 1 5 1

Nonlinear Classification

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Based on the original slides of Hung-yi Lee

CSC321 Lecture 9: Generalization

Linear Regression, Neural Networks, etc.

ECE521 Lectures 9 Fully Connected Neural Networks

18.6 Regression and Classification with Linear Models

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Introduction to Deep Neural Networks

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural networks and support vector machines

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Course 395: Machine Learning - Lectures

Linear discriminant functions

Machine Learning. Neural Networks

Introduction to Neural Networks

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning Lecture 5

Lecture 35: Optimization and Neural Nets

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

2018 EE448, Big Data Mining, Lecture 5. (Part II) Weinan Zhang Shanghai Jiao Tong University

Neural Networks. Learning and Computer Vision Prof. Olga Veksler CS9840. Lecture 10

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Data Mining & Machine Learning

COMP-4360 Machine Learning Neural Networks

Deep Neural Networks (1) Hidden layers; Back-propagation

1 What a Neural Network Computes

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Bits of Machine Learning Part 1: Supervised Learning

Introduction to Deep Learning

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Bayesian Networks (Part I)

Grundlagen der Künstlichen Intelligenz

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Neural networks. Chapter 20, Section 5 1

Intelligent Systems Discriminative Learning, Neural Networks

Artificial Neuron (Perceptron)

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Introduction to Neural Networks

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Neural networks and optimization

Neural Networks (Part 1) Goals for the lecture

RegML 2018 Class 8 Deep learning

CSC321 Lecture 8: Optimization

Supervised Learning. George Konidaris

Transcription:

SGD and Deep Learning

Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1

Subgradients f(w 1 )+v 2 (w w 1 ) f(w 1 )+v 1 (w w 1 ) w 1 There are multiple lines that touch a point and are below the curve. The slope of each of these is a sub gradient The set of all such slopes is call the sub differential

Subgradients Formally, v is a sub gradient at w1 if: f(w 1 )+v (w w 1 ) apple f(w) 8w For f(w) = w the sub-differential is [-1,1] Let f 1 (w),...,f r (w) be r convex functions. Define: g(w) = max f i (w). Then: g is convex The sub differential is i

Subgradients of piecewise convex Let f 1 (w),...,f r (w) be r differentiable convex functions. Define: g(w) = max f i (w). Then g is convex and: i For given w denote J(w) = arg max i f i (w) The sub differential is the set of vectors defined: X X v = irf i (w) i 0 i =1 i i2j(w) Our SVM hack is actually stochastic sub gradient descent!

Stochastic Gradient Descent (SGD) Consider objective: 1 m min w mx i=1 f i (w) SGD update: w t+1 = w t t r w f i (w t ) Compare to GD: w t+1 = w t t m X r w f i (w t ) We are using a noisy (random) gradient. Randomness comes from the index random variable I. Expected value: E [r w f I (w)] = X P [I = i]r w f i (w) = 1 m i i X r w f i (w) =r w f(w) i

SGD as Noisy Gradient Alternative interpretation of SGD. Think of iterates as random variables Iterate: Given Wt sample a Vt with property: E [V t W t ]=rf(w t ) Update W t+1 = W t V t

Convergence of SGD Assume fi are convex The algorithm has randomness so there are different results for each run. For example, if we happen to sample the same Vi at all iterations, that s bad We can analyze expected performance, or prove performance is good with high probability. Here we ll show expected performance.

Convergence of SGD Assume fi are convex Let w * denote the optimum and assume kw k 2 apple B Assume: kv t k 2 apple G. Step size: = B p G T Let: W = 1 TX W i T i=1 Theorem: E f( W ) f(w ) apple BG p T See lecture notes.

Non Linear Classification All our classifiers thus far have been linear Possibly in a complex feature space, but linear A non-linear classifier would be something like: y =sign [f(x; w)] Where f is a non-linear function of x,w. Problem: optimization almost always non-convex. Solution: Close your eyes and hope for the best

Neural Networks Which non-linear models are of interest? Since we want to model things humans do, why not try to use neural network models. Credit: CLARITY process. K. Deisseroth, Stanford

Neural Networks Caveat: real neural networks are not sufficiently understood: We don t know the structure of connections between neurons (this is what the field of connectomics tries to achieve). We don t sufficiently understand the physiology of each neuron. Thus, what we call neural networks are highly simplified abstraction of real networks.

The Winter of Neural Nets In the 1980s neural network models were proposed and attracted much interest, with some empirical success. In the 1990s interest turned to SVM, which used convex optimization, and were empirically successful.

Neural Nets Revisited - Deep Learning Since 2012 neural nets have re-emerged, and showed striking success on tasks such as: Image understanding Speech recognition Plates Wash the dishes Translation שטוף את הכלים Wash the dishes Game playing

Neural Networks Some properties of neural networks: Neurons sum their inputs (via dendrites). If the sum is large enough. The neuron fires a spike or sequence of spikes, that are transmitted to the axon. Visual input undergoes a sequence of processing steps (first in the retina, then in V1, V2 etc). Caution: This is highly simplified.

Artificial Neural Networks Construct models that have: Summation of inputs followed by non-linearity. A sequence of layers Last layer is often linear. z k = h(w k z k 1 + b k ) z 0 = x z 2 z 1 x z L

Classification with Deep Learning Use network to construct x binary classifier: y =sign [z L ] z 2 z 1 z L Or multi-class: x y = max ȳ z L,ȳ z 2 z 1 z L,1 z L,2 z L,3

Which non-linearity? The non linearity is also called the activation function. z k = h(w k z k 1 + b k ) z 0 = x Rectified linear unit (ReLU) Threshold Sigmoid h(z) = max [0,z] h(z) = 1 1+e z

Expressive Power of Neural Nets Focus on h(z) being the threshold for simplicity. Can express any logical function. For example: (z t,1 _..._ z t,n )=h( X i z t,i 0.5) Can show that any function implemented via a Turing machine running for T iterations, can be implemented with a neural net with O(T 2 ) neurons and edges. See [Shalev Shwartz & Ben David].

Expressive Power of Neural Nets Each z t,i =sign [w t,i z t 1 + b t,i ] encodes a hyperplane in zt-1 space For the first layer it encodes hyperplanes in x space. Can therefore encode complex intersections and unions of hyperspaces.

The effect of number of hidden neurons 3 hidden neurons 6 hidden neurons 20 hidden neurons more neurons = more capacity Slide credit: Fei-Fei Li & Andrej Slide credit: Fei-Fei Li & Andrej Karpathy Karpathy

VC Dimension of Neural Nets Say we have E weights in our network. Assume each is encoded with D bits. Give a bound on the VC dimension of the class of all such networks: VC apple log H = log(2 DE )=DE What if weights are real? VC apple ce log E See Shalev Shwartz and Ben David for proof.

Expressive Power of Neural Nets Given enough units, zl can x implement any function. z 1 z 2 Consider scalar x case and first layer implementing: z L z 2 = d 1 (h(x c 1 ) h(x c 2 )) x Can use to implement any function by adding such components. z 2 d 1 c 1 c 2 z 2 (x)

What kinds of functions can a Neural Network represent? 2 0.0 1 0.2 0.2 0.4 0-1 0.4 0.6 0.6 0.8 0.8 [http://neuralnetworksanddeeplearning.com/chap4.html] 1.0

Architectural Decisions In building a network we need to decide on: Which pairs of neurons are connected? e.g., can have K layers with all neurons in layer t connected to all neurons in t+1). What are the activation functions? Do different neurons use the same weights? (parameter tying).

What s in an image? The ImageNet Dataset

Neural Nets for Images One of the striking successes of deep learning ImageNet dataset: 1000 classes. About a million labeled images. In 2012, the paper ImageNet Classification with Deep Convolutional Neural Networks by Krizhevsky, Sutskever and Hinton. Previous accuracy (top 5) 25%. Theirs 17%. Today 3%! How did they do it?

Architectures for Images Combination of local features (edges, texture, color) and global (shape, relation between objects). Local features expected to be similar across image. Same weights A convolution of the image with a filter!

Architectures for Images Use many such filters Reduce resolution by pooling: max

Alex Net From: ImageNet Classification with Deep Convolutional Neural Networks 60 Million parameters! Would you expect this to work?

What do the neurons calculate? Create chart Slide credit: Fei-Fei Li & Andrej Karpathy and Yann LeCun

Training Neural Nets SGD is a simple and effective approach to training neural nets. Consider binary case. Minimize hinge loss: min W X i max [0, 1 y i z L (W, x i )] = X i `(y i,z L (W, x i )) We will need to add some regularization on W. Can use other losses. Not convex in W.

Back propagation To run SGD we just need to calculate the gradient of the loss wrt W. There are software tools that will do this for you (e.g., Theano, Tensorflow, Cafe, MXNet). The gradient calculation for feed forward nets is called back propagation (Rumelhart, Hinton and Williams, 86)

Back propagation Back propagation has three phases: Forward pass. For each layer calculate the linear output v t = W t z t 1 linear activation z t = h(v t ) and non Calculate error vector for each layer. from Last to first: `0(y, L = z t = h 0 L (x)) (v t+1 ) Set gradient: @` t T t+1 W t+1 @W t =( t h 0 (v t ))z T t 1 x x z L z L z 1 z 2 z 1 z 2

Problems Non-convex optimization. Why should it converge to anything good?? Empirically it does. Conjectured that loss has mostly saddle points and no bad local minima.

Problems Many parameters, often in the order of the training size. Shouldn t overfitting be an issue?? Explanations: Different regularization approaches used (e.g., l2 regularization or Dropout) Not general ERM but rather optimizing with SGD, which may generalize better (Hardt, Singer, Recht)

Problems Bottom line: we don t know why they for so well. Exciting research question!

Beyond Classification Many cognitive tasks are not simple classification. Speech recognition Translation Driving Here the input is a sequence over time and the output is a sequence.

Sequence to Sequence Mapping Here s a possible approach to translation: Proposed in Sequence to Sequence Learning with Neural by Sutskever et al. 2014 State of the art speech recognition and translation.