Foundations of Artificial Intelligence

Similar documents
Introduction to Convolutional Neural Networks (CNNs)

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Deep Learning Lab Course 2017 (Deep Learning Practical)

Lecture 17: Neural Networks and Deep Learning

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

SGD and Deep Learning

Neural Networks (Part 1) Goals for the lecture

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

From perceptrons to word embeddings. Simon Šuster University of Groningen

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Machine Learning: Multi Layer Perceptrons

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

CSC321 Lecture 5: Multilayer Perceptrons

Neural Networks and Deep Learning

Artificial Neural Networks. MGS Lecture 2

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Introduction to Neural Networks

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Neural Network Training

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Master Recherche IAC TC2: Apprentissage Statistique & Optimisation

Deep Feedforward Networks. Sargur N. Srihari

4. Multilayer Perceptrons

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Approximate Q-Learning. Dan Weld / University of Washington

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Machine Learning Basics III

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artifical Neural Networks

Neural Networks and Deep Learning.

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Introduction to Deep Neural Networks

Intro to Neural Networks and Deep Learning

Convolutional Neural Networks. Srikumar Ramalingam

Neural networks. Chapter 20. Chapter 20 1

Demystifying deep learning. Artificial Intelligence Group Department of Computer Science and Technology, University of Cambridge, UK

Lecture 35: Optimization and Neural Nets

Data Mining Part 5. Prediction

Based on the original slides of Hung-yi Lee

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Feed-forward Network Functions

Reading Group on Deep Learning Session 1

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Lecture 7 Artificial neural networks: Supervised learning

Introduction Biologically Motivated Crude Model Backpropagation

Deep Learning: a gentle introduction

Machine Learning (CSE 446): Neural Networks

Artificial Intelligence

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Neural Networks: Introduction

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Machine Learning Lecture 10

Neural Networks. Intro to AI Bert Huang Virginia Tech

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

CS 4700: Foundations of Artificial Intelligence

Machine Learning Lecture 12

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

EEE 241: Linear Systems

Artificial Neuron (Perceptron)

Lecture 5 Neural models for NLP

Artificial Neural Networks

Feature Design. Feature Design. Feature Design. & Deep Learning

Grundlagen der Künstlichen Intelligenz

David Silver, Google DeepMind

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Deep Learning (CNNs)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 3 Feedforward Networks and Backpropagation

Neural networks. Chapter 19, Sections 1 5 1

Deep Neural Networks (1) Hidden layers; Back-propagation

Making Deep Learning Understandable for Analyzing Sequential Data about Gene Regulation

Neural Networks. Learning and Computer Vision Prof. Olga Veksler CS9840. Lecture 10

Artificial Neural Networks. Historical description

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Machine Learning. Neural Networks

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Deep Learning for NLP

Multilayer Perceptron

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

CS 1674: Intro to Computer Vision. Final Review. Prof. Adriana Kovashka University of Pittsburgh December 7, 2016

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

ECE521 Lecture 7/8. Logistic Regression

Introduction to Deep Learning

CS 4700: Foundations of Artificial Intelligence

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Transcription:

Foundations of Artificial Intelligence 14. Deep Learning An Overview Joschka Boedecker and Wolfram Burgard and Bernhard Nebel Guest lecturer: Frank Hutter Albert-Ludwigs-Universität Freiburg July 14, 2017

Motivation: Deep Learning in the News Foundations of AI July 14, 2017 2

Motivation: Why is Deep Learning so Popular? Excellent empirical results, e.g., in computer vision Foundations of AI July 14, 2017 3

Motivation: Why is Deep Learning so Popular? Excellent empirical results, e.g., in speech recognition Foundations of AI July 14, 2017 4

Motivation: Why is Deep Learning so Popular? Excellent empirical results, e.g., in reasoning in games - Superhuman performance in playing Atari games [Mnih et al, Nature 2015] - Beating the world s best Go player [Silver et al, Nature 2016] Foundations of AI July 14, 2017 5

Motivation: Why is Deep Learning so Popular? Excellent empirical results, e.g., in reasoning in games - Superhuman performance in playing Atari games [Mnih et al, Nature 2015] - Beating the world s best Go player [Silver et al, Nature 2016] More reasons for the popularity of deep learning throughout Foundations of AI July 14, 2017 5

Lecture Overview 1 Representation Learning and Deep Learning 2 Multilayer Perceptrons 3 Optimization of Neural Networks in a Nutshell 4 Overview of Some Advanced Topics Convolutional neural networks Recurrent neural networks Deep reinforcement learning 5 Wrapup Foundations of AI July 14, 2017 6

Lecture Overview 1 Representation Learning and Deep Learning 2 Multilayer Perceptrons 3 Optimization of Neural Networks in a Nutshell 4 Overview of Some Advanced Topics Convolutional neural networks Recurrent neural networks Deep reinforcement learning 5 Wrapup Foundations of AI July 14, 2017 7

Some definitions Representation learning a set of methods that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification Foundations of AI July 14, 2017 8

Some definitions Representation learning a set of methods that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification Deep learning representation learning methods with multiple levels of representation, obtained by composing simple but nonlinear modules that each transform the representation at one level into a [...] higher, slightly more abstract (one) (LeCun et al., 2015) Foundations of AI July 14, 2017 8

Standard Machine Learning Pipeline Standard machine learning algorithms are based on high-level attributes or features of the data E.g., the binary attributes we used for decisions trees This requires (often substantial) feature engineering Foundations of AI July 14, 2017 9

Representation Learning Pipeline Jointly learn features and classifier, directly from raw data This is also referrred to as end-to-end learning Foundations of AI July 14, 2017 10

Shallow vs. Deep Learning Foundations of AI July 14, 2017 11

Shallow vs. Deep Learning Human Cat Dog Classes Object Parts Image Contours Edges Pixels Deep Learning: learning a hierarchy of representations that build on each other, from simple to complex Foundations of AI July 14, 2017 12

Shallow vs. Deep Learning Human Cat Dog Classes Object Parts Image Contours Edges Pixels Deep Learning: learning a hierarchy of representations that build on each other, from simple to complex Quintessential deep learning model: Multilayer Perceptrons Foundations of AI July 14, 2017 12

Biological Inspiration of Artificial Neural Networks Dendrites input information to the cell Neuron fires (has action potential) if a certain threshold for the voltage is exceeded Output of information by axon The axon is connected to dentrites of other cells via synapses Learning: adaptation of the synapse s efficiency, its synaptical weight dendrites SYNAPSES AXON soma Foundations of AI July 14, 2017 13

A Very Brief History of Neural Networks Neural networks have a long history - 1942: artificial neurons (McCulloch/Pitts) - 1958/1969: perceptron (Rosenblatt; Minsky/Papert) - 1986: multilayer perceptrons and backpropagation (Rumelhart) - 1989: convolutional neural networks (LeCun) Foundations of AI July 14, 2017 14

A Very Brief History of Neural Networks Neural networks have a long history - 1942: artificial neurons (McCulloch/Pitts) - 1958/1969: perceptron (Rosenblatt; Minsky/Papert) - 1986: multilayer perceptrons and backpropagation (Rumelhart) - 1989: convolutional neural networks (LeCun) Alternative theoretically motivated methods outperformed NNs - Exaggeraged expectations: It works like the brain (No, it does not!) Foundations of AI July 14, 2017 14

A Very Brief History of Neural Networks Neural networks have a long history - 1942: artificial neurons (McCulloch/Pitts) - 1958/1969: perceptron (Rosenblatt; Minsky/Papert) - 1986: multilayer perceptrons and backpropagation (Rumelhart) - 1989: convolutional neural networks (LeCun) Alternative theoretically motivated methods outperformed NNs - Exaggeraged expectations: It works like the brain (No, it does not!) Why the sudden success of neural networks in the last 5 years? - Data: Availability of massive amounts of labelled data - Compute power: Ability to train very large neural networks on GPUs - Methodological advances: many since first renewed popularization Foundations of AI July 14, 2017 14

Lecture Overview 1 Representation Learning and Deep Learning 2 Multilayer Perceptrons 3 Optimization of Neural Networks in a Nutshell 4 Overview of Some Advanced Topics Convolutional neural networks Recurrent neural networks Deep reinforcement learning 5 Wrapup Foundations of AI July 14, 2017 15

Multilayer Perceptrons hidden units x D w (1) MD z M w (2) KM y K inputs outputs y 1 x 1 x 0 z 1 w (2) 10 z 0 [figure from Bishop, Ch. 5] Network is organized in layers - Outputs of k-th layer serve as inputs of k + 1th layer Each layer k only does quite simple computations: - Linear function of previous layer s outputs z k 1 : a k = W k z k 1 + b k - Nonlinear transformation z k = h k (a k ) through activation function h k Foundations of AI July 14, 2017 16

Activation Functions - Examples Logistic sigmoid activation function: h logistic (a) = 1 1 + exp( a) 1.0 0.8 0.6 0.4 0.2 Logistic hyperbolic tangent activation function: 0.0 8 6 4 2 0 2 4 6 8 1.0 h tanh (a) = tanh(a) exp(a) exp( a) = 0.5 0.0 0.5 exp(a) + exp( a) } 1.0 8 6 4 2 0 2 4 6 8 Foundations of AI July 14, 2017 17

Activation Functions - Examples (cont.) Linear activation function: h linear (a) = a Rectified Linear (ReLU) activation function: h relu (a) = max(0, a) 8 6 4 2 0 2 4 6 8 8 6 4 2 0 2 4 6 8 8 7 6 5 4 3 2 1 0 8 6 4 2 0 2 4 6 8 Foundations of AI July 14, 2017 18

Output unit activation functions Depending on the task, typically: for regression: single output neuron with linear activation Foundations of AI July 14, 2017 19

Output unit activation functions Depending on the task, typically: for regression: single output neuron with linear activation for binary classification: single output neuron with logistic/tanh activation Foundations of AI July 14, 2017 19

Output unit activation functions Depending on the task, typically: for regression: single output neuron with linear activation for binary classification: single output neuron with logistic/tanh activation for multiclass classification: K output neurons and softmax activation (ŷ(x, w)) k = h softmax ((a) k ) = exp((a) k) j exp((a) j) Foundations of AI July 14, 2017 19

Output unit activation functions Depending on the task, typically: for regression: single output neuron with linear activation for binary classification: single output neuron with logistic/tanh activation for multiclass classification: K output neurons and softmax activation (ŷ(x, w)) k = h softmax ((a) k ) = exp((a) k) j exp((a) j) so for the complete output layer: p(y 1 = 1 x) p(y 2 = 1 x) ŷ(x, w) =. = 1 K j=1 exp((a) j) exp(a) p(y K = 1 x) Foundations of AI July 14, 2017 19

Loss function to be minimized Consider binary classification task using a single output unit with logistic sigmoid activation function: 1 ŷ(x, w) = h logistic (a) = 1 + exp( a) Foundations of AI July 14, 2017 20

Loss function to be minimized Consider binary classification task using a single output unit with logistic sigmoid activation function: 1 ŷ(x, w) = h logistic (a) = 1 + exp( a) This defines a (Bernoulli) probability distribution over the label of each data point x n : p(y n = 1 x n, w) = ŷ(x n, w) p(y n = 0 x n, w) = 1 ŷ(x n, w) Foundations of AI July 14, 2017 20

Loss function to be minimized Consider binary classification task using a single output unit with logistic sigmoid activation function: 1 ŷ(x, w) = h logistic (a) = 1 + exp( a) This defines a (Bernoulli) probability distribution over the label of each data point x n : Rewritten: p(y n = 1 x n, w) = ŷ(x n, w) p(y n = 0 x n, w) = 1 ŷ(x n, w) p(y n x n, w) = ŷ(x n, w) yn {1 ŷ(x n, w)} 1 yn Foundations of AI July 14, 2017 20

Loss function to be minimized Consider binary classification task using a single output unit with logistic sigmoid activation function: 1 ŷ(x, w) = h logistic (a) = 1 + exp( a) This defines a (Bernoulli) probability distribution over the label of each data point x n : Rewritten: p(y n = 1 x n, w) = ŷ(x n, w) p(y n = 0 x n, w) = 1 ŷ(x n, w) p(y n x n, w) = ŷ(x n, w) yn {1 ŷ(x n, w)} 1 yn Min. negative log likelihood of this distribution (aka cross entropy): N L(w) = {y n ln ŷ n + (1 y n ) ln(1 ŷ n )} n=1 Foundations of AI July 14, 2017 20

Loss function to be minimized For multiclass classification, use generalization of cross-entropy error: N K L(w) = y kn ln ŷ k (x n, w) n=1 k=1 For regression, e.g., use squared error function: L(w) = 1 2 N {ŷ(x n, w) y n } 2 n=1 Foundations of AI July 14, 2017 21

Optimizing a loss / error function Given training data D = (x i, y i ) N i=1 and topology of an MLP Task: adapt weights w to minimize the loss: minimize w L(w; D) Interpret L just as a mathematical function depending on w and forget about its semantics, then we are faced with a problem of mathematical optimization Foundations of AI July 14, 2017 22

Lecture Overview 1 Representation Learning and Deep Learning 2 Multilayer Perceptrons 3 Optimization of Neural Networks in a Nutshell 4 Overview of Some Advanced Topics Convolutional neural networks Recurrent neural networks Deep reinforcement learning 5 Wrapup Foundations of AI July 14, 2017 23

Optimization theory Discusses mathematical problems of the form: minimize u where u is any vector of suitable size. f(u), Foundations of AI July 14, 2017 24

Optimization theory Discusses mathematical problems of the form: minimize u where u is any vector of suitable size. f(u), Simplification: here, we only consider functions f which are continuous and differentiable y y y x x non continuous function continuous, non differentiable differentiable function (disrupted) function (folded) (smooth) x Foundations of AI July 14, 2017 24

Optimization theory (cont.) A global minimum u is a point such that: y f(u ) f(u) for all u. A local minimum u + is a point such that exist r > 0 with f(u + ) f(u) global minima local x for all points u with u u + < r Foundations of AI July 14, 2017 25

Optimization theory (cont.) Analytical way to find a minimum: For a local minimum u +, the gradient of f becomes zero: f u i (u + ) = 0 for all i Hence, calculating all partial derivatives and looking for zeros is a good idea Foundations of AI July 14, 2017 26

Optimization theory (cont.) Analytical way to find a minimum: For a local minimum u +, the gradient of f becomes zero: f u i (u + ) = 0 for all i Hence, calculating all partial derivatives and looking for zeros is a good idea But: for neural networks, we can t write down a solution for the minimization problem in closed form - even though f u i = 0 holds at (local) solution points need to resort to iterative methods Foundations of AI July 14, 2017 26

Optimization theory (cont.) merical way to find a minimum, rching: Numerical way to find a minimum, ume searching: we are starting at a point assume we start at point u. hich is the best direction to rchwhich for a point is the v best withdirection to v) < search f ( u)? for a point v with f(v) < f(u)? u Foundations of AI July 14, 2017 27

Optimization theory (cont.) erical way to find a minimum, ching: Numerical way to find a minimum, mesearching: we are starting at a point assume we start at point u. ich is the best direction to ch for Which a point is the v best with direction to ) < search f ( u)? for a point v with f(v) < f(u)? egative (descending), slope is negative (descending), go right! u Foundations of AI July 14, 2017 27

Optimization theory (cont.) erical way to find a minimum, ching: Numerical way to find a minimum, mesearching: we are starting at a point assume we start at point u. ich is the best direction to ch for Which a point is the v best withdirection to ) < search f ( u)? for a point v with f(v) < f(u)? u Foundations of AI July 14, 2017 27

Optimization theory (cont.) erical way to find a minimum, ching: Numerical way to find a minimum, me we are starting at a point searching: assume we start at point u. ich is the best direction to ch for Which a point is the v best withdirection to ) < search f ( u)? for a point v with f(v) < f(u)? positive slope is (ascending), positive (ascending), go left! u Foundations of AI July 14, 2017 27

Optimization theory (cont.) ical way to find a minimum, ing: Numerical way to find a minimum, e we searching: are starting at a point assume we start at point u. is the best direction to Which is the best direction to for a point v with search for a point v with f ( u)? f(v) < f(u)? is Which the best is the stepwidth? best stepwidth? u Foundations of AI July 14, 2017 27

Optimization theory (cont.) erical way to find a minimum, hing: Numerical way to find a minimum, me we are starting at a point searching: assume we start at point u. ch is the best direction to h for Which a point is the v with best direction to < f search ( u)? for a point v with f(v) < f(u)? ch is the best stepwidth? Which is the best stepwidth? mall, small step! slope is small, small step! u Foundations of AI July 14, 2017 27

Optimization theory (cont.) erical way to find a minimum, hing: Numerical way to find a minimum, me searching: we are starting at a point assume we start at point u. h is the best direction to h for Which a point is the v with best direction to < f search ( u)? for a point v with f(v) < f(u)? h is the best stepwidth? Which is the best stepwidth? u Foundations of AI July 14, 2017 27

Optimization theory (cont.) rical way to find a minimum, hing: Numerical way to find a minimum, e we searching: are starting at a point assume we start at point u. h is the best direction to h for Which a point is the v with best direction to < f search ( u)? for a point v with f(v) < f(u)? h is Which the best is the stepwidth? best stepwidth? rge, slope large is large, step! large step! u Foundations of AI July 14, 2017 27

Optimization theory (cont.) Numerical way to find a minimum, searching: assume we start at point u. Which is the best direction to search for a point v with f(v) < f(u)? Which is the best stepwidth? general principle: v i u i ɛ f u i ɛ > 0 is called learning rate Foundations of AI July 14, 2017 27

Gradient descent Gradient descent approach: Require: mathematical function f, learning rate ɛ > 0 Ensure: returned vector is close to a local minimum of f 1: choose an initial point u 2: while u f(u) not close to 0 do 3: u u ɛ u f(u) 4: end while 5: return u Note: u f := [ f f u 1,..., u K ] for K-dimensionsal u Foundations of AI July 14, 2017 28

Calculating partial derivatives Our typical loss functions are defined across data points: N L(w) = L n (w) = L(f(x n ; w), y n ) n=1 Foundations of AI July 14, 2017 29

Calculating partial derivatives Our typical loss functions are defined across data points: L(w) = N L n (w) = L(f(x n ; w), y n ) n=1 We can compute their partial derivatives as a sum over data points: L w j = N n=1 L n w j Foundations of AI July 14, 2017 29

Calculating partial derivatives Our typical loss functions are defined across data points: L(w) = N L n (w) = L(f(x n ; w), y n ) n=1 We can compute their partial derivatives as a sum over data points: L w j = N n=1 L n w j The method of backpropagation makes consistent use of the chain rule of calculus to compute the partial derivatives Ln w j w.r.t. each network weight w j, re-using previously computed results - Backpropagation is not covered here, but, e.g., in ML lecture Foundations of AI July 14, 2017 29

Do we need gradients based on the entire data set? Using the entire set is referred to as batch gradient descent Foundations of AI July 14, 2017 30

Do we need gradients based on the entire data set? Using the entire set is referred to as batch gradient descent Gradients get more accurate when based on more data points - But using more data has diminishing returns w.r.t reduction in error - Usually faster progress by updating more often based on cheaper, less accurate estimates of the gradient Foundations of AI July 14, 2017 30

Do we need gradients based on the entire data set? Using the entire set is referred to as batch gradient descent Gradients get more accurate when based on more data points - But using more data has diminishing returns w.r.t reduction in error - Usually faster progress by updating more often based on cheaper, less accurate estimates of the gradient Common approach in practice: compute gradients over mini-batches - Mini-batch: small subset of the training data - Today, this is commonly called stochastic gradient descent (SGD) Foundations of AI July 14, 2017 30

Stochastic gradient descent Stochastic gradient descent (SGD) Require: mathematical function f, learning rate ɛ > 0 Ensure: returned vector is close to a local minimum of f 1: choose an initial point w 2: while stopping criterion not met do 3: Sample a minibatch of m examples x (1),..., x (m) with corresponding targets y (i) from the training set 4: Compute gradient g 1 m w m i=1 L(f(x(i) ; w), y (i) ) 5: Update parameter: w w ɛ g 6: end while 7: return w Foundations of AI July 14, 2017 31

Problems with suboptimal choices for learning rate choice of ɛ 1. case small ɛ: convergence Foundations of AI July 14, 2017 32

Problems with suboptimal choices for learning rate choice of ɛ 2. case very small ɛ: convergence, but it may take very long Foundations of AI July 14, 2017 33

Problems with suboptimal choices for learning rate choice of ɛ 3. case medium size ɛ: convergence Foundations of AI July 14, 2017 34

Problems with suboptimal choices for learning rate choice of ɛ 4. case large ɛ: divergence Foundations of AI July 14, 2017 35

Other reasons for problems with gradient descent flat spots and steep valleys: need larger ɛ in u to jump over the uninteresting flat area but need smaller ɛ in v to meet the minimum u v Foundations of AI July 14, 2017 36

Other reasons for problems with gradient descent flat spots and steep valleys: need larger ɛ in u to jump over the uninteresting flat area but need smaller ɛ in v to meet the minimum u v zig-zagging: in higher dimensions: ɛ is not appropriate for all dimensions Foundations of AI July 14, 2017 36

Learning rate quizz Which curve denotes low, high, very high, and good learning rate? loss a) b) c) d) epoch Foundations of AI July 14, 2017 37

Gradient descent Conclusion Pure gradient descent is a nice framework In practice, stochastic gradient descent is used Finding the right learning rate ɛ is tedious Foundations of AI July 14, 2017 38

Gradient descent Conclusion Pure gradient descent is a nice framework In practice, stochastic gradient descent is used Finding the right learning rate ɛ is tedious Heuristics to overcome problems of gradient descent: Gradient descent with momentum Individual learning rates for each dimension Adaptive learning rates Decoupling steplength from partial derivates Foundations of AI July 14, 2017 38

Lecture Overview 1 Representation Learning and Deep Learning 2 Multilayer Perceptrons 3 Optimization of Neural Networks in a Nutshell 4 Overview of Some Advanced Topics Convolutional neural networks Recurrent neural networks Deep reinforcement learning 5 Wrapup Foundations of AI July 14, 2017 39

Lecture Overview 1 Representation Learning and Deep Learning 2 Multilayer Perceptrons 3 Optimization of Neural Networks in a Nutshell 4 Overview of Some Advanced Topics Convolutional neural networks Recurrent neural networks Deep reinforcement learning 5 Wrapup Foundations of AI July 14, 2017 40

Historical context and inspiration from Neuroscience Hubel & Wiesel (Nobel prize 1981) found in several studies in the 1950s and 1960s: Visual cortex has feature detectors (e.g., cells with preference for edges with specific orientation) - edge location did not matter Simple cells as local feature detectors Complex cells pool responses of simple cells There is a feature hierarchy Foundations of AI July 14, 2017 41

Learned feature hierarchy [From recent Yann LeCun slides] [slide credit: Andrej Karpathy] Foundations of AI July 14, 2017 42

Convolutions illustrated nvolution Layer 32x32x3 image Filters always extend the full depth of the input volume 5x5x3 filter 32 Convolve the filter with the image i.e. slide over the image spatially, computing dot products 3 32 [slide credit: Andrej Karpathy] Foundations of AI July 14, 2017 43

Convolutions illustrated (cont.) volution Layer 32 32x32x3 image 5x5x3 filter 3 32 1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product + bias) [slide credit: Andrej Karpathy] Foundations of AI July 14, 2017 44

Convolutions illustrated (cont.) volution Layer 32 32x32x3 image 5x5x3 filter activation map 28 convolve (slide) over all spatial locations 3 32 1 28 [slide credit: Andrej Karpathy] Foundations of AI July 14, 2017 45

Convolutions several filters volution Layer consider a second, green filter 32 32x32x3 image 5x5x3 filter activation maps 28 convolve (slide) over all spatial locations 3 32 1 28 [slide credit: Andrej Karpathy] Foundations of AI July 14, 2017 46

Convolutions several filters For example, if we had 6 5x5 filters, we ll get 6 separate activation maps: activation maps 32 28 Convolution Layer 32 28 3 6 We stack these up to get a new image of size 28x28x6! [slide credit: Andrej Karpathy] Foundations of AI July 14, 2017 47

Stacking several convolutional layers Convolutional layers stacked in a ConvNet 32 28 24 3 32 CONV, ReLU e.g. 6 5x5x3 filters 6 28 CONV, ReLU e.g. 10 5x5x6 filters 10 24 CONV, ReLU. [slide credit: Andrej Karpathy] Foundations of AI July 14, 2017 48

Learned feature hierarchy [From recent Yann LeCun slides] [slide credit: Andrej Karpathy] Foundations of AI July 14, 2017 49

Lecture Overview 1 Representation Learning and Deep Learning 2 Multilayer Perceptrons 3 Optimization of Neural Networks in a Nutshell 4 Overview of Some Advanced Topics Convolutional neural networks Recurrent neural networks Deep reinforcement learning 5 Wrapup Foundations of AI July 14, 2017 50

Feedforward Recurrent vs vs Recurrent Feedforward Neural Networks networks...... d network (left) and a recurrent network [Source: Source: Jaeger, 2001] Jaeger, 2001 Foundations of AI July 14, 2017 51

Recurrent Neural Networks (RNNs) Neural Networks that allow for cycles in the connectivity graph Foundations of AI July 14, 2017 52

Recurrent Neural Networks (RNNs) Neural Networks that allow for cycles in the connectivity graph Cycles let information persist in the network for some time (state), and provide a time-context or (fading) memory Foundations of AI July 14, 2017 52

Recurrent Neural Networks (RNNs) Neural Networks that allow for cycles in the connectivity graph Cycles let information persist in the network for some time (state), and provide a time-context or (fading) memory Very powerful for processing sequences Foundations of AI July 14, 2017 52

Recurrent Neural Networks (RNNs) Neural Networks that allow for cycles in the connectivity graph Cycles let information persist in the network for some time (state), and provide a time-context or (fading) memory Very powerful for processing sequences Implement dynamical systems rather than function mappings, and can approximate any dynamical system with arbitrary precision Foundations of AI July 14, 2017 52

Recurrent Neural Networks (RNNs) Neural Networks that allow for cycles in the connectivity graph Cycles let information persist in the network for some time (state), and provide a time-context or (fading) memory Very powerful for processing sequences Implement dynamical systems rather than function mappings, and can approximate any dynamical system with arbitrary precision They are Turing-complete [Siegelmann and Sontag, 1991] Foundations of AI July 14, 2017 52

Abstract schematic With fully connected hidden layer: Foundations of AI July 14, 2017 53

Sequence-to-sequence mapping Mapping one to many many to one image caption generation temporal classification Foundations of AI July 14, 2017 54

Sequence-to-sequence mapping Mapping (cont.) many to many many to many video frame labeling automatic translation Foundations of AI July 14, 2017 55

Lecture Overview 1 Representation Learning and Deep Learning 2 Multilayer Perceptrons 3 Optimization of Neural Networks in a Nutshell 4 Overview of Some Advanced Topics Convolutional neural networks Recurrent neural networks Deep reinforcement learning 5 Wrapup Foundations of AI July 14, 2017 56

Reinforcement Learning Finding optimal policies for MDPs Reminder: states s S, actions a A, transition model T, rewards r Policy: complete mapping π : S A that specifies for each state s which action π(s) to take Foundations of AI July 14, 2017 57

Deep Reinforcement Learning Policy-based deep RL - Represent policy π : S A as a deep neural network with weights w - Evaluate w by rolling out the policy defined by w - Optimize weights to obtain higher rewards (using approx. gradients) - Examples: AlphaGo & modern Atari agents Foundations of AI July 14, 2017 58

Deep Reinforcement Learning Policy-based deep RL - Represent policy π : S A as a deep neural network with weights w - Evaluate w by rolling out the policy defined by w - Optimize weights to obtain higher rewards (using approx. gradients) - Examples: AlphaGo & modern Atari agents Value-based deep RL - Basically value iteration, but using a deep neural network (= function approximator) to generalize across many states and actions - Approximate optimal state-value function U(s) or state-action value function Q(s, a) Foundations of AI July 14, 2017 58

Deep Reinforcement Learning Policy-based deep RL - Represent policy π : S A as a deep neural network with weights w - Evaluate w by rolling out the policy defined by w - Optimize weights to obtain higher rewards (using approx. gradients) - Examples: AlphaGo & modern Atari agents Value-based deep RL - Basically value iteration, but using a deep neural network (= function approximator) to generalize across many states and actions - Approximate optimal state-value function U(s) or state-action value function Q(s, a) Model-based deep RL - If transition model T is not known - Approximate T with a deep neural network (learned from data) - Plan using this approximate transition model Foundations of AI July 14, 2017 58

Deep Reinforcement Learning Policy-based deep RL - Represent policy π : S A as a deep neural network with weights w - Evaluate w by rolling out the policy defined by w - Optimize weights to obtain higher rewards (using approx. gradients) - Examples: AlphaGo & modern Atari agents Value-based deep RL - Basically value iteration, but using a deep neural network (= function approximator) to generalize across many states and actions - Approximate optimal state-value function U(s) or state-action value function Q(s, a) Model-based deep RL - If transition model T is not known - Approximate T with a deep neural network (learned from data) - Plan using this approximate transition model Use deep neural networks to represent policy / value function / model Foundations of AI July 14, 2017 58

Lecture Overview 1 Representation Learning and Deep Learning 2 Multilayer Perceptrons 3 Optimization of Neural Networks in a Nutshell 4 Overview of Some Advanced Topics Convolutional neural networks Recurrent neural networks Deep reinforcement learning 5 Wrapup Foundations of AI July 14, 2017 59

An Exciting Approach to AI: Learning as an Alternative to Traditional Programming We don t understand how the human brain solves certain problems - Face recognition - Speech recognition - Playing Atari games - Picking the next move in the game of Go We can nevertheless learn these tasks from data/experience Foundations of AI July 14, 2017 60

An Exciting Approach to AI: Learning as an Alternative to Traditional Programming We don t understand how the human brain solves certain problems - Face recognition - Speech recognition - Playing Atari games - Picking the next move in the game of Go We can nevertheless learn these tasks from data/experience If the task changes, we simply re-train Foundations of AI July 14, 2017 60

An Exciting Approach to AI: Learning as an Alternative to Traditional Programming We don t understand how the human brain solves certain problems - Face recognition - Speech recognition - Playing Atari games - Picking the next move in the game of Go We can nevertheless learn these tasks from data/experience If the task changes, we simply re-train We can construct computer systems that are too complex for us to understand anymore ourselves... - E.g., deep neural networks have millions of weights. - E.g., AlphaGo, the system that beat world champion Lee Sedol + David Silver, lead author of AlphaGo cannot say why a move is good + Paraphrased: You would have to ask a Go expert. Foundations of AI July 14, 2017 60

Summary: Why is Deep Learning so Popular? Excellent empirical results in many domains - very scalable to big data - but beware: not a silver bullet Foundations of AI July 14, 2017 61

Summary: Why is Deep Learning so Popular? Excellent empirical results in many domains - very scalable to big data - but beware: not a silver bullet Analogy to the ways humans process information - mostly tangential Foundations of AI July 14, 2017 61

Summary: Why is Deep Learning so Popular? Excellent empirical results in many domains - very scalable to big data - but beware: not a silver bullet Analogy to the ways humans process information - mostly tangential Allows end-to-end learning - no more need for many complicated subsystems - e.g., dramatically simplified Google s translation Foundations of AI July 14, 2017 61

Summary: Why is Deep Learning so Popular? Excellent empirical results in many domains - very scalable to big data - but beware: not a silver bullet Analogy to the ways humans process information - mostly tangential Allows end-to-end learning - no more need for many complicated subsystems - e.g., dramatically simplified Google s translation Very versatile/flexible - easy to combine building blocks - allows supervised, unsupervised, and reinforcement learning Foundations of AI July 14, 2017 61

Lots of Work on Deep Learning in Freiburg Computer Vision (Thomas Brox) - Images, video Robotics (Wolfram Burgard) - Navigation, grasping, object recognition Neurorobotics (Joschka Boedecker) - Robotic control Machine Learning (Frank Hutter) - Optimization of deep nets, learning to learn Neuroscience (Tonio Ball, Michael Tangermann, and others ) - EEG data and other applications from BrainLinks-BrainTools Details when the individual groups present their research Foundations of AI July 14, 2017 62

Summary by learning goals Having heard this lecture, you can now... Explain the terms representation learning and deep learning Describe the main principles behind MLPs Describe how neural networks are optimized in practice On a high level, describe - Convolutional Neural Networks - Recurrent Neural Networks - Deep Reinforcement Learning Foundations of AI July 14, 2017 63