Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Similar documents
Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

Perceptron & Neural Networks

Deep Learning (CNNs)

Artificial Neuron (Perceptron)

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Advanced Machine Learning

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation

Deep Neural Networks (1) Hidden layers; Back-propagation

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Ch.6 Deep Feedforward Networks (2/3)

Deep neural networks

Lecture 17: Neural Networks and Deep Learning

ECE521 Lectures 9 Fully Connected Neural Networks

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Deep Neural Networks (1) Hidden layers; Back-propagation

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Deep Learning for NLP

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

CSC 411 Lecture 10: Neural Networks

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Neural Networks in Structured Prediction. November 17, 2015

CS60010: Deep Learning

Lecture 2: Learning with neural networks

Artificial Neural Networks. MGS Lecture 2

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

CS489/698: Intro to ML

ECE521 Lecture 7/8. Logistic Regression

Neural Networks and Deep Learning

Artificial Neural Networks

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

CSC321 Lecture 5: Multilayer Perceptrons

Bayesian Networks (Part I)

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Deep Learning Lab Course 2017 (Deep Learning Practical)

Neural Networks. Intro to AI Bert Huang Virginia Tech

Machine Learning (CSE 446): Backpropagation

Neural networks COMS 4771

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Introduction to Machine Learning

Learning from Data: Multi-layer Perceptrons

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Intro to Neural Networks and Deep Learning

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04

Introduction to Convolutional Neural Networks (CNNs)

Neural Networks: Backpropagation

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Lecture 4 Backpropagation

Lecture 2: Modular Learning

Based on the original slides of Hung-yi Lee

Machine Learning

CSC321 Lecture 6: Backpropagation

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Feed-forward Network Functions

Natural Language Processing with Deep Learning CS224N/Ling284

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Machine Learning Basics III

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Neural Networks: Introduction

NEURAL LANGUAGE MODELS

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Convolutional Neural Network Architecture

Training Neural Networks Practical Issues

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Lecture 5: Logistic Regression. Neural Networks

Perceptron (Theory) + Linear Regression

Introduction to Machine Learning (67577)

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Reading Group on Deep Learning Session 1

Machine Learning (CSE 446): Neural Networks

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Learning Neural Networks

CSC321 Lecture 4: Learning a Classifier

text classification 3: neural networks

Lecture 35: Optimization and Neural Nets

CSC321 Lecture 7 Neural language models

STA 414/2104: Lecture 8

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Deep Feedforward Networks

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Automatic Differentiation and Neural Networks

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

A summary of Deep Learning without Poor Local Minima

Neural networks (NN) 1

Multilayer Perceptrons and Backpropagation

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Deep Learning & Artificial Intelligence WS 2018/2019

From perceptrons to word embeddings. Simon Šuster University of Groningen

Artificial Neural Networks

Jakub Hajic Artificial Intelligence Seminar I

Transcription:

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 12 Feb 23, 2018 1

Neural Networks Outline Logistic Regression (Recap) Data, Model, Learning, Prediction Neural Networks A Recipe for Machine Learning Visual Notation for Neural Networks Example: Logistic Regression Output Surface 2-Layer Neural Network 3-Layer Neural Network Neural Net Architectures Objective Functions Activation Functions Backpropagation Basic Chain Rule (of calculus) Chain Rule for Arbitrary Computation Graph Backpropagation Algorithm Module-based Automatic Differentiation (Autodiff) Last Lecture This Lecture 2

ARCHITECTURES 3

Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make: 1. # of hidden layers (depth) 2. # of units per hidden layer (width) 3. Type of activation function (nonlinearity) 4. Form of objective function 4

Building a Neural Net Q: How many hidden units, D, should we use? Output Features 5

Building a Neural Net Q: How many hidden units, D, should we use? Output Hidden Layer D = M 1 1 1 Input 6

Building a Neural Net Q: How many hidden units, D, should we use? Output Hidden Layer D = M Input 7

Building a Neural Net Q: How many hidden units, D, should we use? Output Hidden Layer D = M Input 8

Building a Neural Net Q: How many hidden units, D, should we use? Output What method(s) is this setting similar to? Hidden Layer D < M Input 9

Building a Neural Net Q: How many hidden units, D, should we use? Output Hidden Layer D > M What method(s) is this setting similar to? Input 10

Decision Functions Deeper Networks Q: How many layers should we use? Output Hidden Layer 1 Input 11

Decision Functions Deeper Networks Q: How many layers should we use? Output Hidden Layer 2 Hidden Layer 1 Input 12

Decision Functions Deeper Networks Q: How many layers should we use? Output Hidden Layer 3 Hidden Layer 2 Hidden Layer 1 Input 13

Decision Functions Deeper Networks Q: How many layers should we use? Theoretical answer: A neural network with 1 hidden layer is a universal function approximator Cybenko (1989): For any continuous function g(x), there exists a 1-hidden-layer neural net h θ (x) s.t. h θ (x) g(x) < ϵ for all x, assuming sigmoid activation functions Output Empirical answer: Before 2006: Deep networks (e.g. 3 or more hidden layers) Hidden Layer 1 are too hard to train After 2006: Deep networks are easier to train than shallow networks (e.g. 2 or fewer layers) for many problems Input Big caveat: You need to know and use the right tricks. 14

Decision Functions Different Levels of Abstraction We don t know the right levels of abstraction So let the model figure it out! Example from Honglak Lee (NIPS 2010) 18

Decision Functions Different Levels of Abstraction Face Recognition: Deep Network can build up increasingly higher levels of abstraction Lines, parts, regions Example from Honglak Lee (NIPS 2010) 19

Decision Functions Different Levels of Abstraction Output Hidden Layer 3 Hidden Layer 2 Hidden Layer 1 Input Example from Honglak Lee (NIPS 2010) 20

Activation Functions Neural Network with sigmoid activation functions (F) Loss J = 1 2 (y y )2 Output (E) Output (sigmoid) y = 1 1+ ( b) Hidden Layer (D) Output (linear) b = D j=0 jz j Input (C) Hidden (sigmoid) z j = 1 1+ ( a j ), j (B) Hidden (linear) a j = M i=0 jix i, j (A) Input Given x i, i 21

Activation Functions Neural Network with arbitrary nonlinear activation functions (F) Loss J = 1 2 (y y )2 Output (E) Output (nonlinear) y = (b) Hidden Layer (D) Output (linear) b = D j=0 jz j Input (C) Hidden (nonlinear) z j = (a j ), j (B) Hidden (linear) a j = M i=0 jix i, j (A) Input Given x i, i 22

Activation Functions Sigmoid / Logistic Function 1 logistic(u) 1+ e u So far, we ve assumed that the activation function (nonlinearity) is always the sigmoid function 23

Activation Functions A new change: modifying the nonlinearity The logistic is not widely used in modern ANNs Alternate 1: tanh Like logistic function but shifted to range [-1, +1] Slide from William Cohen

AI Stats 2010 depth 4? sigmoid vs. tanh Figure from Glorot & Bentio (2010)

Activation Functions A new change: modifying the nonlinearity relu often used in vision tasks Alternate 2: rectified linear unit Linear with a cutoff at zero (Implementation: clip the gradient when you pass zero) Slide from William Cohen

Activation Functions A new change: modifying the nonlinearity relu often used in vision tasks Alternate 2: rectified linear unit Soft version: log(exp(x)+1) Doesn t saturate (at one end) Sparsifies outputs Helps with vanishing gradient Slide from William Cohen

Objective Functions for NNs 1. Quadratic Loss: the same objective as Linear Regression i.e. mean squared error 2. Cross-Entropy: the same objective as Logistic Regression i.e. negative log likelihood This requires probabilities, so we add an additional softmax layer at the end of our network Forward Backward Quadratic J = 1 2 (y y )2 dj dy = y Cross Entropy J = y (y)+(1 y ) (1 y) y dj dy = y 1 y +(1 y ) 1 y 1 28

Objective Functions for NNs Cross-entropy vs. Quadratic loss Figure from Glorot & Bentio (2010)

Multi-Class Output Output Hidden Layer Input 30

Multi-Class Output Softmax: y k = Output Hidden Layer (b k ) K l=1 (b l) (F) Loss J = K k=1 y k (y k) (E) Output (softmax) y k = (b k) K l=1 (b l) (D) Output (linear) b k = D j=0 kjz j k (C) Hidden (nonlinear) z j = (a j ), j (B) Hidden (linear) a j = M i=0 jix i, j Input (A) Input Given x i, i 31

BACKPROPAGATION 32

Background A Recipe for Machine Learning 1. Given training data: 3. Define goal: 2. Choose each of these: Decision function Loss function 4. Train with SGD: (take small steps opposite the gradient) 33

Training Approaches to Differentiation Question 1: When can we compute the gradients of the parameters of an arbitrary neural network? Question 2: When can we make the gradient computation efficient? 34

Training Approaches to Differentiation 1. Finite Difference Method Pro: Great for testing implementations of backpropagation Con: Slow for high dimensional inputs / outputs Required: Ability to call the function f(x) on any input x 2. Symbolic Differentiation Note: The method you learned in high-school Note: Used by Mathematica / Wolfram Alpha / Maple Pro: Yields easily interpretable derivatives Con: Leads to exponential computation time if not carefully implemented Required: Mathematical expression that defines f(x) 3. Automatic Differentiation - Reverse Mode Note: Called Backpropagation when applied to Neural Nets Pro: Computes partial derivatives of one output f(x) i with respect to all inputs x j in time proportional to computation of f(x) Con: Slow for high dimensional outputs (e.g. vector-valued functions) Required: Algorithm for computing f(x) 4. Automatic Differentiation - Forward Mode Note: Easy to implement. Uses dual numbers. Pro: Computes partial derivatives of all outputs f(x) i with respect to one input x j in time proportional to computation of f(x) Con: Slow for high dimensional inputs (e.g. vector-valued x) Required: Algorithm for computing f(x) 35

Training Finite Difference Method Notes: Suffers from issues of floating point precision, in practice Typically only appropriate to use on small examples with an appropriately chosen epsilon 36

Training Symbolic Differentiation Chain Rule Quiz #1: Suppose x = 2 and z = 3, what are dy/dx and dy/dz for the function below? 37

Training Symbolic Differentiation Calculus Quiz #2: 39

Training Chain Rule Whiteboard Chain Rule of Calculus 40

Training { That is, the computation Given: : y = g(u) and u = h(x). Chain es. Rule: dy i dx k = JX j=1 dy i du j du j dx k, Chain Rule 8i, k 41

Training { That is, the computation Given: : y = g(u) and u = h(x). Chain es. Rule: dy i dx k = JX j=1 dy i du j du j dx k, Chain Rule 8i, k Backpropagation is just repeated application of the chain rule from Calculus 101. 42