Machine Learning

Similar documents
Machine Learning

Artificial Neural Networks

Neural Networks (Part 1) Goals for the lecture

Artificial Neural Networks

Artificial Neural Networks

Lecture 4: Perceptrons and Multilayer Perceptrons

Course 395: Machine Learning - Lectures

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Introduction to Machine Learning

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

Lecture 5: Logistic Regression. Neural Networks

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Neural Networks and Deep Learning.

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Introduction to Neural Networks

Neural Networks and the Back-propagation Algorithm

A summary of Deep Learning without Poor Local Minima

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Deep Feedforward Networks

Lab 5: 16 th April Exercises on Neural Networks

Neural Networks biological neuron artificial neuron 1

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

CSC 411 Lecture 10: Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

AI Programming CS F-20 Neural Networks

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

COMP-4360 Machine Learning Neural Networks

Understanding Neural Networks : Part I

Multilayer Neural Networks

CSC321 Lecture 5: Multilayer Perceptrons

Artifical Neural Networks

Supervised Learning in Neural Networks

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Pattern Classification

Multilayer Perceptron

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Based on the original slides of Hung-yi Lee

Gradient Descent Training Rule: The Details

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Advanced Machine Learning

y(x n, w) t n 2. (1)

Multilayer Perceptrons and Backpropagation

Neural networks. Chapter 20. Chapter 20 1

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Plan. Perceptron Linear discriminant. Associative memories Hopfield networks Chaotic networks. Multilayer perceptron Backpropagation

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Multilayer Neural Networks

Artificial Neural Networks. Edward Gatt

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Topic 3: Neural Networks

Statistical NLP for the Web

Deep Feedforward Networks

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Tips for Deep Learning

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Lecture 7 Artificial neural networks: Supervised learning

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

4. Multilayer Perceptrons

Artificial Neural Networks. MGS Lecture 2

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Neural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1

Neural networks. Chapter 19, Sections 1 5 1

Backpropagation: The Good, the Bad and the Ugly

Introduction to Machine Learning Spring 2018 Note Neural Networks

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Machine Learning Basics III

Jakub Hajic Artificial Intelligence Seminar I

Neural networks COMS 4771

C4 Phenomenological Modeling - Regression & Neural Networks : Computational Modeling and Simulation Instructor: Linwei Wang

CS:4420 Artificial Intelligence

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Neural Networks and Deep Learning

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Artificial Neuron (Perceptron)

Learning Neural Networks

Neural Nets Supervised learning

Introduction Biologically Motivated Crude Model Backpropagation

Feedforward Neural Nets and Backpropagation

Multilayer Perceptron

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Tips for Deep Learning

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Linear discriminant functions

Part 8: Neural Networks

Reading Group on Deep Learning Session 1

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Neural networks. Chapter 20, Section 5 1

Machine Learning (CSE 446): Neural Networks

Introduction to Convolutional Neural Networks (CNNs)

Transcription:

Machine Learning 10-315 Maria Florina Balcan Machine Learning Department Carnegie Mellon University 03/29/2019 Today: Artificial neural networks Backpropagation Reading: Mitchell: Chapter 4 Bishop: Chapter 5

Artificial Neural Network (ANN) Biological systems built of very complex webs of interconnected neurons. Highly connected to other neurons, and performs computations by combining signals from other neurons. Outputs of these computations may be transmitted to one or more other neurons. Artificial Neural Networks built out of a densely interconnected set of simple units (e.g., sigmoid units). Each unit takes real-valued inputs (possibly the outputs of other units) and produces a real-valued output (which may become input to many other units).

input: two features from spectral analysis of a spoken sound output: vowel sound occurring in the context h d

ALVINN [Pomerleau 1993]

Artificial Neural Networks to learn f: X Y f w typically a non-linear function, f w : X Y X feature space: (vector of) continuous and/or discrete vars Y ouput space: (vector of) continuous and/or discrete vars f w network of basic units Learning algorithm: given x d, t d d D, train weights w of all units to minimize sum of squared errors of predicted network outputs. Find parameters w to minimize Use gradient descent! d D f w x d t d 2

What type of units should we use? Classifier is a multilayer network of units. Each unit takes some inputs and produces one output. Output of one unit can be the input of another. Input layer 1 Hidden layer Output layer 0.5 0.3

Multilayer network of Linear units? Advantage: we know how to do gradient descent on linear units Input layer Hidden layer Output layer x 0 = 1 w 0,2 w 0,1 w 1,1 v 1 = w 1 x w 3,1 x 1 w 1,2 w 2,1 w 3,2 z = w 3 v x 2 w 2,2 v 2 = w 2 x Problem: linear of linear is just linear. z = w 3,1 w 1 x + w 3,2 w 2 x = w 3,1 w 1 + w 3,2 w 2 x = linear

Multilayer network of Perceptron units? Advantage: Can produce highly non-linear decision boundaries! Input layer Hidden layer Output layer x 0 = 1 w 0,2 w 0,1 w 1,1 v 1 = (w 1 x) w 3,1 x 1 w 1,2 w 2,1 w 3,2 z = (w 3 v) x 2 w 2,2 v 2 = (w 2 x) Threshold function: x = 1 if x is positive, 0 if x is negative. Problem: discontinuous threshold is not differentiable. Can t do gradient descent.

Multilayer network of sigmoid units Advantage: Can produce highly non-linear decision boundaries! Sigmoid is differentiable, so can use gradient descent Input layer Hidden layer Output layer x 0 = 1 w 0,2 x 1 w 1,2 w 0,1 w 1,1 w 2,1 v 1 = σ(w 1 x) w 3,1 w 3,2 z = σ(w 3 v) x 2 w 2,2 v 2 = σ(w 2 x) σ x = 1 1 + e x = Very useful in practice!

The Sigmoid Unit σ is the sigmoid function; σ x = 1 1+e x Nice property: dσ x dx = σ x 1 σ x We can derive gradient descent rules to train One sigmoid unit Multilayer networks of sigmoid units Backpropagation

Gradient Descent to Minimize Squared Error Goal: Given x d, t d d D find w to minimize E D w = 1 2 d D f w x d t d 2 Batch mode Gradient Descent: Do until satisfied 1. Compute the gradient E D [w] 2. w w η E D [w] Incremental (stochastic) Gradient Descent: Do until satisfied For each training example d in D E w E E E,,, w 0 w 1 w n 1. Compute the gradient E d [w] 2. w w η E d [w] E d w 1 2 t d o d 2 Note: Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if η made small enough

Gradient descent in weight space Goal: Given x d, t d d D find w to minimize E D w = 1 2 d D f w x d t d 2 This error measure defines a surface over the hypothesis (i.e. weight) space figure from Cho & Chow, Neurocomputing 1999 w 2 w 1 13

Gradient descent in weight space Gradient descent is an iterative process aimed at finding a minimum in the error surface. on each iteration current weights define a point in this space find direction in which error surface descends most steeply take a step (i.e. update weights) in that direction Error w 1 14 w 2

Gradient descent in weight space Calculate the gradient of E: é ÑE(w) = ê E, E,, E ë w 0 w 1 w n ù ú û Take a step in the opposite direction Dw = -h ÑE( w) Dw i = -h E w i Error - E w 1 w 1 w 2 - E w 2

Taking derivative: chain rule Recall the chain rule from calculus y = f (u) u = g(x) y x = y u u x

Gradient Descent for the Sigmoid Unit Given x d, t d d D find w to minimize d D o d t d 2 o d = observed unit output for x d o d = σ(net d ); net d = w i x i,d i E w i = 1 w i 2 d D t d o d 2 = 1 2 d D w i t d o d 2 = 1 2 d D = d D 2 t d o d w i t d o d = t d o d o d net d net d w i d D t d o d o d w i But we know: o d net d = σ net d net d = o d 1 o d and net d w i = w x d w i = x i,d So: E w i = d D t d o d o d 1 o d x i,d

Gradient Descent for the Sigmoid Unit Given x d, t d d D find w to minimize d D o d t d 2 o d = observed unit output for x d o d = σ(net d ); net d = w i x i,d i E w i = d D t d o d o d 1 o d x i,d E = δ w d x i,d i d D δ d error term t d o d multiplied by o d 1 o d that comes from the derivative of the sigmoid function Update rule: w w η E[w]

Gradient Descent for Multilayer Networks Given x d, t d d D find w to minimize 1 2 d D k Outputs o k,d t kd 2

Backpropagation Algorithm Incremental/stochastic gradient descent Initialize all weights to small random numbers. Until satisfied, Do: For each training example (x, t) do: 1. Input the training example to the network and compute the network outputs 2. For each output unit k: 3. For each hidden unit h: δ k o k (1 o k )(t k o k ) k h o = observed unit output t = target output x = input x i,j = ith input to jth unit w ij = wt from i to j δ h o h 1 o h σ k outputs w h,k δ k 4. Update each network weight w i,j w i,j w i,j + Δw i,j where Δw i,j = ηδ j x i,j

Validation/generalization error first decreases, then increases. Weights tuned to fit the idiosyncrasies of the training examples that are not representative of the general distribution. Stop when lowest error over validation set. Not always obvious when lowest error over validation set has been reached.

Dealing with Overfitting Our learning algorithm involves a parameter n=number of gradient descent iterations How do we choose n to optimize future error? Separate available data into training and validation set Use training to perform gradient descent n number of iterations that optimizes validation set error

Dealing with Overfitting Regularization techniques norm constraint dropout batch normalization data augmentation early stopping

Expressive Capabilities of ANNs Boolean functions: Every Boolean function can be represented by a network with a single hidden layer But might require exponential (in number of inputs) hidden units Continuous functions: Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989] Any function can be approximated to arbitrarily accuracy by a network with two hidden layers [Cybenko 1988]

Representing Simple Boolean Functions Inputs x i {0,1} Or function x i1 x i2 x ik And function x i1 x i2 x ik And with negations x i1 xҧ i2 x ik x 0 = 1 x 0 = 1 x 0 = 1 x 1 x 2 1 0 0.5 x 1 x 2 1 0 0.5 k x 1 x 2 1 0 0.5 t 1 1 1 x n x n x n w i = 1 if i is an i j w i = 0 otherwise w i = 1 if i is an i j w i = 0 otherwise w i = 1 if i is i j not negated w i = 1 if i is i j negated w i = 0 otherwise t = # not negated

General Boolean functions Every Boolean function can be represented by a network with a single hidden layer; might require exponential # of hidden units Can write any Boolean function as a truth table: 000 + 001 010 011 + 100 101 110 + 111 + View as OR of ANDs, with one AND for each positive entry. x 1 ҧ xҧ 2 xҧ 3 x 1 ҧ x 2 x 3 x 1 x 2 xҧ 3 x 1 x 2 x 3 Then combine AND and OR networks into a 2-layer network.

Artificial Neural Networks: Summary Highly non-linear regression/classification Vector valued inputs and outputs Potentially millions of parameters to estimate Actively used to model distributed comptutation in the brain Hidden layers learn intermediate representations Stochastic gradient descent, local minima problems Overfitting and how to deal with it.

Other Activation Functions Problem with sigmoid: saturation Too small gradient x r( ) y Figure borrowed from Pattern Recognition and Machine Learning, Bishop

Other Activation Functions Activation function ReLU (rectified linear unit) ReLU z = max{z, 0} Figure from Deep learning, by Goodfellow, Bengio, Courville.

Other Activation Functions Activation function ReLU (rectified linear unit) Gradient ReLU 0 z = max{z, 0} Gradient 1

Other Activation Functions Generalizations of ReLU grelu z = max z, 0 + α min{z, 0} Leaky-ReLU z = max{z, 0} + 0.01 min{z, 0} Parametric-ReLU z : α learnable grelu z z