Lecture 16: Introduction to Neural Networks

Similar documents
Lecture 13: Introduction to Neural Networks

Multilayer Perceptron

Deep Neural Networks (1) Hidden layers; Back-propagation

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Neural Networks: Backpropagation

SGD and Deep Learning

Deep Neural Networks (1) Hidden layers; Back-propagation

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

ECE521 Lectures 9 Fully Connected Neural Networks

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation

An artificial neural networks (ANNs) model is a functional abstraction of the

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Lecture 13 Back-propagation

LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Bidirectional Representation and Backpropagation Learning

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

CSC242: Intro to AI. Lecture 21

Neural Network Language Modeling

Artificial Neural Network : Training

y(x n, w) t n 2. (1)

Learning in State-Space Reinforcement Learning CIS 32

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Neural Networks (Part 1) Goals for the lecture

IFT Lecture 7 Elements of statistical learning theory

CSC321 Lecture 5: Multilayer Perceptrons

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Pattern Classification

Multilayer Perceptrons and Backpropagation

18.6 Regression and Classification with Linear Models

Logistic Regression & Neural Networks

10-701/15-781, Machine Learning: Homework 4

Lab 5: 16 th April Exercises on Neural Networks

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Feedforward Neural Nets and Backpropagation

AI Programming CS F-20 Neural Networks

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

The Perceptron Algorithm 1

KAI ZHANG, SHIYU LIU, AND QING WANG

Least Mean Squares Regression

Artificial Neuron (Perceptron)

Name (NetID): (1 Point)

Supervised Learning. George Konidaris

Sections 18.6 and 18.7 Analysis of Artificial Neural Networks

Introduction to Machine Learning

Machine Learning (CSE 446): Backpropagation

Part 8: Neural Networks

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Learning from Examples

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Lecture 11. Linear Soft Margin Support Vector Machines

Multilayer Perceptron = FeedForward Neural Network

FINAL: CS 6375 (Machine Learning) Fall 2014

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Multi-layer Neural Networks

Neural Networks and the Back-propagation Algorithm

Introduction: The Perceptron

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Advanced Machine Learning

Least Mean Squares Regression. Machine Learning Fall 2018

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Sections 18.6 and 18.7 Artificial Neural Networks

Simple Neural Nets For Pattern Classification

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

A summary of Deep Learning without Poor Local Minima

Neural networks. Chapter 20. Chapter 20 1

Revision: Neural Network

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Sections 18.6 and 18.7 Artificial Neural Networks

In the Name of God. Lecture 11: Single Layer Perceptrons

Lecture 17: Neural Networks and Deep Learning

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Single-tree GMM training

Linear Models in Machine Learning

Lecture 5 Neural models for NLP

Error Functions & Linear Regression (2)

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Lecture 4: Perceptrons and Multilayer Perceptrons

Final Examination CS 540-2: Introduction to Artificial Intelligence

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning

Machine Learning Lecture Notes

Input layer. Weight matrix [ ] Output layer

Slide04 Haykin Chapter 4: Multi-Layer Perceptrons

Lecture 4: Training a Classifier

Machine Learning (CSE 446): Neural Networks

Multiclass Classification-1

CS489/698: Intro to ML

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Neural Nets Supervised learning

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Introduction to (Convolutional) Neural Networks

Neural networks. Chapter 19, Sections 1 5 1

Transcription:

Lecture 16: Introduction to Neural Networs Instructor: Aditya Bhasara Scribe: Philippe David CS 5966/6966: Theory of Machine Learning March 20 th, 2017 Abstract In this lecture, we consider Bacpropagation, a standard algorithm that s used to solve the ERM problem in neural networds. We also discuss the use of Stochastic Gradient Descent (SGD) in Bacpropagation. The use of SGD in the neural networ setting is motivated by the high cost of running bac propagation over the full training set. 1 Recap: The Expressive Power of Neural Networs In previous lecture, we started formalizing feed-forward neural networs. These networs tae a collection of inputs, pass these inputs through a series of layers composed of neurons, and use this mapping to produce an output. We then defined what threshold networs are: they are basically a collection of neurons in every layer where each neuron computes a weighted threshold function. We then went on to study the expressive power of these neural networs. Namely, we looed at what type of functions can be implemented using a neural networ. We looed at two theorems that related to this question. The first said that a feed-forward networ with a single hidden layer can compute any arbitrary functions - this is nown as universal approximation theorem. We then said that while this may loo powerful, there are many functions that require an exponential number of inputs to be approximated. We proved this by saying VC dimension of the class of all such networs (all networs with m edges, and n neurons) is O((m + n)log(m + n). While we didn t prove this theorem, we used it to conclude the previous theorem. What this theorem states is that if you are interested in learning neural networs that ust have at most m edges, all you need is roughly so many samples. If a class of functions has VC dimension d, then to learn earn the best networ need VC dimension/ɛ 2 samples. This is what is nown as the Fundamental Theorem of Learning Theory. Nevertheless, these is a caveat: All this tells you is that if you had a bunch of samples and if you had a procedure that could find the best Empirical Ris Minimizer (ERM) for these examples you could also get the best networ. VC theory tells us that VC dimension/ɛ 2 examples + an algorithm for solving ERM implies efficient learning. It turns out, however, for Neural Networs, this second part if very hard. We went on to show that the the ERM problem for a two-layer neural networ with ust 3 neurons in each layer is NP-Hard as it can be reduced to coloring a -colorable graph with 2-1 colors for general which is nown to be NP-Hard. We will prove a simplified version of this theorem in assignment.#3 1

2. The Bacpropagation Algorithm 2 The Bacpropagation Algorithm The bacpropagation algorithm the process of training a neural networ. The ey to training a neural networ is finding the right set of weights for all of the connections to mae the right decisions. This algorithm wor to solves the above problem in a very similar manner to gradient descent as it iteratively progressively wor their way towards the optimal solution. Clearly, there are no guarantees for it because the problem is NP-Hard, but this is a very good heuristic approach. We will start with a simple examples. Suppose you have a one layer neural networ What is the training problem? The training problem is that you want to minimize the empirical ris given by the following equation: (1) ( f ( x ) 6 = y i ) i where the inputs are x (i) and yi correspond the labels for these inputs. As this is ust an indicator function, and we want something to be smooth, the above is not the optimal. Nonetheless, there are various proxies for this and a commonly used one is the squared loss: (2) loss = ( f ( x ) y i )2 i The networ is now very simple, we have w1, w2,..., wn. f is a threshold function: (3) f = σ ( (w x )) where i corresponds to the input point and is th index of the weights. From here, gradient descent appears to be the natural idea to find the best. To do this you start with some w0 and set wt+1 = w(t) γ5 f (w(t). The issue with 2

this is with computing the gradient of f (w(t) ). We don t have a contentious gradient as the gradient of a simple threshold function is undefined. A commonly used alternative is the rectified linear unit activation function which grows linearly to 1. This function is also non-linear (despite its name). If we only use linear activation functions in a neural networ, the output will ust be a linear transformation of the input, which is not enough to form a universal function approximator. Now we now what f is, we can run gradient descent. Using the chain rule: (4) loss = w 2( f (x(i) ) yi )(σ0 ( w x )x Now explore an stochastic gradient descent (SGD) version of this. What do you mean by this? Remember that the loss being a sum of terms is something we often encounter. In such a setting, computing the loss gradients is complicated so we will ust tae a random example and move along the gradient for that example. This is the common way of training neural networs, you don t loo at the loss summed over everything. But you tae a random training examples and you update according to the gradient for that example. This gives us the following algorithm: For t = 1...T. Pic an index it withinn. w ( t ) = w ( t +1) γ 5 li t ( w ( t ) ). where l (w(t) ) = i li (w(t) ) = i ( f ( xi) ) yi )2. In SGD we iteratively update our weight parameters in the direction of the gradient of the loss function until we have reached a minimum. Unlie traditional gradient descent, we do not use the entire dataset to compute the gradient at each iteration. Instead, at each iteration we randomly select a single data point from our dataset and move in the direction of the gradient with respect to that data point. 3

3. Computing weight vector derivative in 2-layer neural networs 3 Computing weight vector derivative in 2-layer neural networs As we have seen, for a one layer NN, computing this derivative is very simple. For the 2-Layer NN, this computation becomes more complicated. The first question that naturally arises are what parameter do we want learn if we want to learn the best neural networ of this form? in order to estimate this loss function we need to have τs for every node, τi and τ (top layer ). Where τ represents the thresholds and the weights between nodes are represented by ui and wi. To run SGD, you need to pic a random data point, and move along the gradient. (5) f = σ u y (6) = σ0 ( u y )y u where σ is the threshold function. But what is (7) ( l=1 ul yl ) = σ0 ( u y ) = σ 0 ( u y ) ui (8) yi = 0( wil xl ), xl = inputs n (9) 4 n = σ0 ( u y )ui σ0 ( wi x ) x = σ0 ( f )σ0 (yi )ui x

4 (10) Computing weight vector derivative in 3-layer neural networs σ( =1 u y ) = Taing this derivative will result in multiple summations. (11) y = σ0 (( u y )) u ( l ) w i q is a weighted sum of these yl s with respect to wi where yl = σ (( t=1 vlt zq )) which implies that yl = σ0 (yl ) vli zi 5