Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Similar documents
CSC321 Lecture 5: Multilayer Perceptrons

CSC 411 Lecture 10: Neural Networks

1 What a Neural Network Computes

Course 395: Machine Learning - Lectures

Binary Classification / Perceptron

Machine Learning

Artificial Neural Networks

Introduction to Machine Learning

AI Programming CS F-20 Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Introduction to Machine Learning Spring 2018 Note Neural Networks

Artifical Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks DWML, /25

Lecture 4: Perceptrons and Multilayer Perceptrons

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Data Mining Part 5. Prediction

Neural Networks and the Back-propagation Algorithm

Neural Networks (Part 1) Goals for the lecture

Machine Learning

Nonlinear Classification

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Artificial Neural Networks

Neural Networks biological neuron artificial neuron 1

CSC321 Lecture 4: Learning a Classifier

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Artificial Neuron (Perceptron)

SGD and Deep Learning

Topic 3: Neural Networks

Multilayer Perceptrons (MLPs)

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Final Examination CS 540-2: Introduction to Artificial Intelligence

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

CSC321 Lecture 4: Learning a Classifier

Lecture 5: Logistic Regression. Neural Networks

Pattern Classification

Introduction to Neural Networks

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Introduction to (Convolutional) Neural Networks

Neural Networks: Introduction

Multilayer Neural Networks

Based on the original slides of Hung-yi Lee

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Regularization in Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Neural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1

CSCI567 Machine Learning (Fall 2018)

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

4. Multilayer Perceptrons

Neural networks. Chapter 20. Chapter 20 1

Machine Learning (CSE 446): Neural Networks

CS:4420 Artificial Intelligence

Neural networks. Chapter 19, Sections 1 5 1

Jakub Hajic Artificial Intelligence Seminar I

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Artificial Neural Networks. MGS Lecture 2

CSC321 Lecture 9: Generalization

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Classification with Perceptrons. Reading:

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Introduction to Deep Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Multilayer Neural Networks

ECE521 Lecture 7/8. Logistic Regression

9 Classification. 9.1 Linear Classifiers

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Artificial Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks

Machine Learning. Neural Networks

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

COMP-4360 Machine Learning Neural Networks

A summary of Deep Learning without Poor Local Minima

CSC 578 Neural Networks and Deep Learning

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

CSC321 Lecture 6: Backpropagation

Machine Learning Lecture 5

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

From perceptrons to word embeddings. Simon Šuster University of Groningen

CSC321 Lecture 9: Generalization

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Artificial Intelligence

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Neural Networks: Backpropagation

Neural networks. Chapter 20, Section 5 1

CSCI 315: Artificial Intelligence through Deep Learning

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04

10 NEURAL NETWORKS Bio-inspired Multi-Layer Networks. Learning Objectives:

Neural Networks and Deep Learning

) (d o f. For the previous layer in a neural network (just the rightmost layer if a single neuron), the required update equation is: 2.

Feedforward Neural Nets and Backpropagation

Transcription:

Neural Networks Nicholas Ruozzi University of Texas at Dallas

Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify handwritten digits A simple algorithmic technique can solve this problem with 95% accuracy This seems surprising, in fact, stateof-the-art methods can achieve near 99% accuracy (you ve probably seen these in action if you ve deposited a check recently) Digits from the MNIST data set 2

Neural Networks The basis of neural networks was developed in the 1940s -1960s The idea was to build mathematical models that might compute in the same way that neurons in the brain do As a result, neural networks are biologically inspired, though many of the algorithms developed for them are not biologically plausible Perform surprisingly well for the handwritten digit recognition task 3

Neural Networks Neural networks consist of a collection of artificial neurons There are different types of neuron models that are commonly studied The perceptron (one of the first studied) The sigmoid neuron (one of the most common, but many more) Rectified linear units A neural network is typically a directed graph consisting of a collection of neurons (the nodes in the graph), directed edges (each with an associated weight), and a collection of fixed binary inputs 4

The Perceptron A perceptron is an artificial neuron that takes a collection of binary inputs and produces a binary output The output of the perceptron is determined by summing up the weighted inputs and thresholding the result: if the weighted sum is larger than the threshold, the output is one (and zero otherwise) x 1 x 2 y x 3 y = ቊ 1 w 1x 1 + w 2 x 2 + w 3 x 3 > threshold 0 otherwise 5

The Perceptron x 1 x 2 y x 3 y = ቊ 1 w 1x 1 + w 2 x 2 + w 3 x 3 > threshold 0 otherwise The weights can be both positive and negative Many simple decisions can be modeled using perceptrons 6

Perceptron for NOT x y Choose w = 1, threshold =.5 1 x >.5 y = ቊ 0 x.5 7

Perceptron for OR 8

Perceptron for OR x 1 x 2 ᴠ y Choose w 1 = w 2 = 1, threshold = 0 y = ቊ 1 x 1 + x 2 > 0 0 x 1 + x 2 0 9

Perceptron for AND 10

Perceptron for AND x 1 x 2 ᴧ y Choose w 1 = w 2 = 1, threshold = 1.5 y = ቊ 1 x 1 + x 2 > 1.5 0 x 1 + x 2 1.5 11

Perceptron for XOR 12

Perceptron for XOR Need more than one perceptron! x 1 ᴠ ᴧ y x 2 ᴧ Weights for incoming edges are chosen as before Networks of perceptrons can encode any circuit! 13

Perceptrons Perceptrons are usually expressed in terms of a collection of input weights and a bias b (which is the negative threshold) x 1 x 2 y x 3 y = ቊ 1 w 1x 1 + w 2 x 2 + w 3 x 3 + b > 0 0 otherwise A single node perceptron is just a linear classifier This is actually where the perceptron algorithm comes from 14

Neural Networks Gluing a bunch of perceptrons together gives us a neural network In general, neural nets have a collection of binary inputs and a collection of binary outputs Inputs Outputs 15

Beyond Perceptrons Given a collection of input-output pairs, we d like to learn the weights of the neural network so that we can correctly predict the ouput of an unseen input We could try learning via gradient descent (e.g., by minimizing the Hamming loss) This approach doesn t work so well: small changes in the weights can cause dramatic changes in the output This is a consequence of the discontinuity of sharp thresholding (same problem we saw in SVMs) 16

The Sigmoid Neuron A sigmoid neuron is an artificial neuron that takes a collection of inputs in the interval [0,1] and produces an output in the interval [0,1] The output is determined by summing up the weighted inputs plus the bias and applying the sigmoid function to the result x 1 x 2 y x 3 y = σ(w 1 x 1 + w 2 x 2 + w 3 x 3 + b) where σ is the sigmoid function 17

The Sigmoid Function The sigmoid function is a continuous function that approximates a step function σ z = 1 1 + e z 18

Rectified Linear Units The sigmoid neuron approximates a step function as a smooth function The relu approximates a hinge loss max(0, x) as a smooth continuous function ln(1 + e x ) 19

Multilayer Neural Networks from Neural Networks and Deep Learning by Michael Nielson 20

Multilayer Neural Networks NO intralayer connections from Neural Networks and Deep Learning by Michael Nielson 21

Neural Network for Digit Classification from Neural Networks and Deep Learning by Michael Nielson 22

Neural Network for Digit Classification Why 10 instead of 4? from Neural Networks and Deep Learning by Michael Nielson 23

Expressiveness of NNs Boolean functions Every Boolean function can be represented by a network with a single hidden layer consisting of possibly exponentially many hidden units Continuous functions Every bounded continuous function can be approximated up to arbitrarily small error by a network with one hidden layer Any function can be approximated to arbitrary accuracy with two hidden layers 24

Training Neural Networks To do the learning, we first need to define a loss function to minimize C w, b = 1 2M m y m a(x m, w, b) 2 The training data consists of input output pairs (x 1, y 1 ),, (x M, y M ) a(x m, w, b) is the output of the neural network for the m th sample w and b are the weights and biases 25

Gradient of the Loss The derivative of the loss function is calculated as follows C(w, b) w k = 1 M m y m a(x m, w, b) a(xm, w, b) w k To compute the derivative of a, use the chain rule and the derivative of the sigmoid function dσ(z) dz = σ z (1 σ z ) This gets complicated quickly with lots of layers of neurons 26

Stochastic Gradient Descent To make the training more practical, stochastic gradient descent is used instead of standard gradient descent Recall, the idea of stochastic gradient descent is to approximate the gradient of a sum by sampling a few indices and averaging n x f i (x) 1 K x f i k(x) i=1 k=1 here, for example, each i k is sampled uniformly at random from {1,, n} K 27

Computing the Gradient We ll compute the gradient for a single sample Some definitions: L is the number of layers C w, b = y a(x, w, b) 2 a j l is the output of the j th neuron on the l th layer z j l is the input of the j th neuron on the l th layer z j l = δ j l is defined to be C z j l k w l jk a l 1 l k + b j 28

Computing the Gradient For the output layer, we have the following partial derivative C z j L = y j a j L a L j L z j = y j a j L σ z j L z j L = y j a j L σ z j L 1 σ z j L = δ j L For simplicity, we will denote the vector of all such partials for each node in the l th layer as δ l 29

Computing the Gradient For the L 1 layer, we have the following partial derivative C z k L 1 = j = j = j = j = j a j L y j a j L y j a j L y j a j L y j a j L y j a j L z k L 1 σ z j L z k L 1 σ z j L σ z j L σ z j L 1 σ z j L 1 σ z j L 1 σ z j L = (δ L ) T L L 1 L 1 w k 1 σ z k σ z k z j L z k L 1 σ k w L jk a L 1 L k + b j z k L 1 σ z k L 1 1 σ z k L 1 w jk L 30

Computing the Gradient We can think of w l as a matrix This allows us to write δ L 1 = (δ L ) T w L 1 σ z L 1 σ z L 1 where σ z L 1 is the vector whose k th component is σ z k L 1 Applying the same strategy, for l < L δ l = (δ l+1 ) T w l+1 1 σ z l σ z l 31

Computing the Gradient Now, for the partial derivatives that we care about C l b = C l j z z l j l j b = δ j l j C l w jk l = C l z z j l j w jk = δ j l a k l 1 We can compute these derivatives one layer at a time! 32

Backpropagation: Putting it all together Compute the inputs/outputs for each layer by starting at the input layer and applying the sigmoid functions Compute δ L for the output layer δ L = y j a j L σ z j L 1 σ z j L Starting from l = L 1 and working backwards, compute δ l = (δ l+1 ) T w l+1 σ z l 1 σ z l Perform gradient descent b l j = b l l j γ δ j l w jk l = w jk γ δ j l a k l 1 33

Backpropagation Backpropagation converges to a local minimum (loss is not convex in the weights and biases) Like EM, can just run it several times with different initializations Training can take a very long time (even with stochastic gradient descent) Prediction after learning is fast Sometimes include a momentum term α in the gradient update w t = w t 1 γ w C t 1 + α γ w C(t 2) 34

Overfitting 35

Overfitting 36

Neural Networks in Practice Many ways to improve weight learning in NNs Use a regularizer! (better generalization) Try other loss functions Initialize the weights of the network more cleverly etc. Random initializations are likely to be far from optimal The learning procedure can have numerical difficulties if there are a large number of layers 37

Regularized Loss Penalize learning large weights C w,b = 1 2M m y m a x m, w, b 2 + λ 2 w 2 2 Can still use the backpropagation algorithm in this setting l 1 regularization can also be useful Regularization can significantly help with overfitting, but λ will often need to be quite large as the size of the training set is typically much larger than what we have been working with How to choose λ? 38

Dropout A heuristic bagging-style approach applied to neural networks to counteract overfitting Randomly remove a certain percentage of the neurons from the network and then train only on the remaining neurons The networks are recombined using an approximate averaging technique (keeping around too many networks and doing proper bagging can be costly in practice) 39

Other Techniques Early stopping Stop the learning early in the hopes that this prevents overfitting Parameter tying Assume some of the weights in the model are the same to reduce the dimensionality of the learning problem Also a way to learn simpler models 40

Other Ideas Convolutional neural networks Instead of the output of every neuron at layer l being used as an input to every neuron at layer l + 1, the edges between layers are chosen more locally Many tied weights and biases (i.e., convolution nets apply the same process to many different local chunks of neurons) Often combined with pooling layers (i.e., layers that, say, half the number of neurons by replacing small regions of neurons with their maximum output) Used extensively in neural nets for image classification tasks 41