Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Similar documents
Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Machine Learning Lecture 10

Machine Learning Lecture 12

Machine Learning Lecture 5

Machine Learning Lecture 7

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Reading Group on Deep Learning Session 1

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Logistic Regression. COMP 527 Danushka Bollegala

Warm up: risk prediction with logistic regression

Ch 4. Linear Models for Classification

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Artificial Neural Networks. MGS Lecture 2

Linear Models for Classification

Machine Learning Lecture 14

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Linear & nonlinear classifiers

Logistic Regression & Neural Networks

Neural Network Training

Machine Learning for Signal Processing Bayes Classification and Regression

Neural Networks and Deep Learning

Lecture 5: Logistic Regression. Neural Networks

Kernelized Perceptron Support Vector Machines

Lecture 17: Neural Networks and Deep Learning

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning Lecture 2

Artificial Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

ML4NLP Multiclass Classification

Lecture Machine Learning (past summer semester) If at least one English-speaking student is present.

Grundlagen der Künstlichen Intelligenz

2018 EE448, Big Data Mining, Lecture 5. (Part II) Weinan Zhang Shanghai Jiao Tong University

Lecture 21. Repetition Bastian Leibe RWTH Aachen Advanced Machine Learning

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Neural networks and support vector machines

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Pattern Recognition and Machine Learning

Linear discriminant functions

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Kernel Methods and Support Vector Machines

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Course 395: Machine Learning - Lectures

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

CSC 411 Lecture 17: Support Vector Machine

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Machine Learning for NLP

Support Vector Machines for Classification: A Statistical Portrait

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Lecture 2

Statistical Data Mining and Machine Learning Hilary Term 2016

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Multiclass Logistic Regression

CSC321 Lecture 5: Multilayer Perceptrons

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture

Discriminative Models

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

An Introduction to Statistical and Probabilistic Linear Models

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Introduction to Convolutional Neural Networks (CNNs)

From perceptrons to word embeddings. Simon Šuster University of Groningen

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Statistical Pattern Recognition

Machine Learning for Structured Prediction

(Kernels +) Support Vector Machines

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Multilayer Perceptron

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Machine Learning Lecture 14

Linear & nonlinear classifiers

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

In the Name of God. Lecture 11: Single Layer Perceptrons

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Machine Learning Lecture 2

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Binary Classification / Perceptron

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Support vector machines Lecture 4

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Discriminative Models

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Max Margin-Classifier

Feedforward Neural Nets and Backpropagation

Transcription:

Advanced Machine Learning Lecture 12 Neural Networks 10.12.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de

This Lecture: Advanced Machine Learning Regression Approaches Linear Regression Regularization (Ridge, Lasso) Gaussian Processes Learning with Latent Variables Prob. Distributions & Approx. Inference Mixture Models EM and Generalizations Deep Learning Linear Discriminants Neural Networks Backpropagation CNNs, RNNs, RBMs, etc.

Recap: Generalized Linear Discriminants Extension with non-linear basis functions Transform vector x with M nonlinear basis functions Á j (x): Basis functions Á j (x) allow non-linear decision boundaries. Activation function g( ) bounds the influence of outliers. Disadvantage: minimization no longer in closed form. Notation with Á 0 (x) = 1 Slide adapted from Bernt Schiele 3

Recap: Gradient Descent Iterative minimization w (0) kj Start with an initial guess for the parameter values. Move towards a (local) minimum by following the gradient. Basic strategies Batch learning w ( +1) kj = w ( ) kj @E(w) @w kj w( ) Sequential updating where w ( +1) kj = w ( ) kj E(w) = @E n(w) @w kj NX E n (w) n=1 w( ) 4

Recap: Gradient Descent Example: Quadratic error function NX E(w) = (y(x n ; w) t n ) 2 n=1 Sequential updating leads to delta rule (=LMS rule) w ( +1) kj = w ( ) kj (y k(x n ; w) t kn ) Á j (x n ) = w ( ) kj ± kná j (x n ) where ± kn = y k (x n ; w) t kn Simply feed back the input data point, weighted by the classification error. Slide adapted from Bernt Schiele 5

Recap: Probabilistic Discriminative Models Consider models of the form p(c 1 já) = y(á) = ¾(w T Á) with p(c 2 já) = 1 p(c 1 já) This model is called logistic regression. Properties Probabilistic interpretation But discriminative method: only focus on decision hyperplane Advantageous for high-dimensional spaces, requires less parameters than explicitly modeling p(á C k ) and p(c k ). 6

Recap: Logistic Regression Let s consider a data set {Á n,t n } with n = 1,,N, Á n = Á(x n ) t n 2 f0; 1g t = (t 1 ; : : : ; t N ) T where and,. With y n = p(c 1 Á n ), we can write the likelihood as p(tjw) = NY n=1 y t n n f1 y n g 1 t n Define the error function as the negative log-likelihood E(w) = ln p(tjw) NX = ft n ln y n + (1 t n ) ln(1 y n )g n=1 This is the so-called cross-entropy error function. 8

Recap: Gradient of the Error Function Gradient for logistic regression NX re(w) = (y n t n )Á n n=1 This is the same result as for the Delta (=LMS) rule w ( +1) kj = w ( ) kj (y k(x n ; w) t kn )Á j (x n ) We can use this to derive a sequential estimation algorithm. However, this will be quite slow More efficient to use 2 nd -order Newton-Raphson IRLS 9

Recap: Softmax Regression Multi-class generalization of logistic regression In logistic regression, we assumed binary labels t n 2 f0; 1g Softmax generalizes this to K values in 1-of-K notation. 2 3 2 P (y = 1jx; w) exp(w > 3 1 x) P (y = 2jx; w) y(x; w) = 6 4 7. 5 = 1 exp(w 2 > P K j=1 exp(w> j x) 6 x) 7 4. 5 P (y = Kjx; w) exp(w K > x) This uses the softmax function Note: the resulting distribution is normalized. 11

Recap: Softmax Regression Cost Function Logistic regression Alternative way of writing the cost function E(w) = = Softmax regression NX ft n ln y n + (1 t n ) ln(1 y n )g n=1 NX 1X n=1 k=0 fi (t n = k) ln P (y n = kjx n ; w)g Generalization to K classes using indicator functions. E(w) = r wk E(w) = NX KX n=1 k=1 ( I (t n = k) ln ) exp(w k > P x) K j=1 exp(w> j x) NX [I (t n = k) ln P (y n = kjx n ; w)] n=1 12

Side Note: Support Vector Machine (SVM) Basic idea The SVM tries to find a classifier which maximizes the margin between pos. and neg. data points. Up to now: consider linear classifiers w T x + b = 0 Margin Formulation as a convex optimization problem Find the hyperplane satisfying arg min w;b under the constraints 1 2 kwk2 t n (w T x n + b) 1 based on training data points x n and target values. 8n t n 2 f 1; 1g 13

SVM Analysis Traditional soft-margin formulation 1 NX min w2r D ;» n 2R + 2 kwk2 + C subject to the constraints Different way of looking at it We can reformulate the constraints into the objective function. 1 min w2r D 2 kwk2 + C n=1» n NX [1 t n y(x n )] + n=1 Maximize the margin Most points should be on the correct side of the margin L 2 regularizer where [x] + := max{0,x}. Slide adapted from Christoph Lampert Hinge loss 14

SVM Error Function (Loss Function) Ideal misclassification error Squared error Hinge error Robust to outliers! Not differentiable! Favors sparse solutions! Hinge error used in SVMs Zero error for points outside the margin (z n > 1). Linearly increasing error for misclassified points (z n < 1). Leads to sparse solutions, not sensitive to outliers. Not differentiable around z n = 1 Cannot be optimized directly. 15 Image source: Bishop, 2006

SVM Discussion SVM optimization function 1 min w2r D 2 kwk2 + C L 2 regularizer NX n=1 [1 t n y(x n )] + Hinge loss Hinge loss enforces sparsity Only a subset of training data points actually influences the decision boundary. This is different from sparsity obtained through the regularizer! There, only a subset of input dimensions are used. Unconstrained optimization, but non-differentiable function. Solve, e.g. by subgradient descent Currently most efficient: stochastic gradient descent Slide adapted from Christoph Lampert 16

Topics of This Lecture A Short History of Neural Networks Perceptrons Definition Loss functions Regularization Limits Multi-Layer Perceptrons Definition Learning 17

A Brief History of Neural Networks 1957 Rosenblatt invents the Perceptron And a cool learning algorithm: Perceptron Learning Hardware implementation Mark I Perceptron for 20 20 pixel image analysis The embryo of an electronic computer that [...] will be able to walk, talk, see, write, reproduce itself and be conscious of its existence. 18 Image source: Wikipedia, clipartpanda.com

A Brief History of Neural Networks 1957 Rosenblatt invents the Perceptron 1969 Minsky & Papert They showed that (single-layer) Perceptrons cannot solve all problems. This was misunderstood by many that they were worthless. Neural Networks don t work! 19 Image source: colourbox.de, thinkstock

A Brief History of Neural Networks 1957 Rosenblatt invents the Perceptron 1969 Minsky & Papert 1980s Resurgence of Neural Networks Some notable successes with multi-layer perceptrons. Backpropagation learning algorithm Oh no! Killer robots will achieve world domination! OMG! They work like the human brain! 20 Image sources: clipartpanda.com, cliparts.co

A Brief History of Neural Networks 1957 Rosenblatt invents the Perceptron 1969 Minsky & Papert 1980s Resurgence of Neural Networks Some notable successes with multi-layer perceptrons. Backpropagation learning algorithm But they are hard to train, tend to overfit, and have unintuitive parameters. So, the excitement fades again. sigh! 21 Image source: clipartof.com, colourbox.de

A Brief History of Neural Networks 1957 Rosenblatt invents the Perceptron 1969 Minsky & Papert 1980s Resurgence of Neural Networks 1995+ Interest shifts to other learning methods Notably Support Vector Machines Machine Learning becomes a discipline of its own. I can do research, me! 22 Image source: clipartof.com

A Brief History of Neural Networks 1957 Rosenblatt invents the Perceptron 1969 Minsky & Papert 1980s Resurgence of Neural Networks 1995+ Interest shifts to other learning methods Notably Support Vector Machines Machine Learning becomes a discipline of its own. The general public and the press still love Neural Networks. I m doing Machine Learning. So, you re using Neural Networks? Actually... 23

A Brief History of Neural Networks 1957 Rosenblatt invents the Perceptron 1969 Minsky & Papert 1980s Resurgence of Neural Networks 1995+ Interest shifts to other learning methods 2005+ Gradual progress Better understanding how to successfully train deep networks Availability of large datasets and powerful GPUs Still largely under the radar for many disciplines applying ML Are you using Neural Networks? Come on. Get real! 24

A Brief History of Neural Networks 1957 Rosenblatt invents the Perceptron 1969 Minsky & Papert 1980s Resurgence of Neural Networks 1995+ Interest shifts to other learning methods 2005+ Gradual progress 2012 Breakthrough results ImageNet Large Scale Visual Recognition Challenge A ConvNet halves the error rate of dedicated vision approaches. Deep Learning is widely adopted. It works! 25 Image source: clipartpanda.com, clipartof.com

Topics of This Lecture A Short History of Neural Networks Perceptrons Definition Loss functions Regularization Limits Multi-Layer Perceptrons Definition Learning 26

Perceptrons (Rosenblatt 1957) Standard Perceptron Output layer Input Layer Hand-designed features based on common sense Outputs Weights Input layer Linear outputs Logistic outputs Learning = Determining the weights w 27

Extension: Multi-Class Networks One output node per class Output layer Weights Input layer Outputs Linear outputs Logistic outputs Can be used to do multidimensional linear regression or multiclass classification. Slide adapted from Stefan Roth 28

Extension: Non-Linear Basis Functions Straightforward generalization Output layer Weights Feature layer Mapping (fixed) Input layer Outputs Linear outputs Logistic outputs 29

Extension: Non-Linear Basis Functions Straightforward generalization Output layer Weights Feature layer Mapping (fixed) Input layer Remarks Perceptrons are generalized linear discriminants! Everything we know about the latter can also be applied here. Note: feature functions Á(x) are kept fixed, not learned! 30

Perceptron Learning Very simple algorithm Process the training cases in some permutation If the output unit is correct, leave the weights alone. If the output unit incorrectly outputs a zero, add the input vector to the weight vector. If the output unit incorrectly outputs a one, subtract the input vector from the weight vector. This is guaranteed to converge to a correct solution if such a solution exists. Slide adapted from Geoff Hinton 31

Perceptron Learning Let s analyze this algorithm... Process the training cases in some permutation If the output unit is correct, leave the weights alone. If the output unit incorrectly outputs a zero, add the input vector to the weight vector. If the output unit incorrectly outputs a one, subtract the input vector from the weight vector. Translation w ( +1) kj = w ( ) kj (y k(x n ; w) t kn ) Á j (x n ) Slide adapted from Geoff Hinton 32

Perceptron Learning Let s analyze this algorithm... Process the training cases in some permutation If the output unit is correct, leave the weights alone. If the output unit incorrectly outputs a zero, add the input vector to the weight vector. If the output unit incorrectly outputs a one, subtract the input vector from the weight vector. Translation w ( +1) kj = w ( ) kj (y k(x n ; w) t kn ) Á j (x n ) This is the Delta rule a.k.a. LMS rule! Perceptron Learning corresponds to 1 st -order (stochastic) Gradient Descent of a quadratic error function! Slide adapted from Geoff Hinton 33

Loss Functions We can now also apply other loss functions L2 loss Least-squares regression L1 loss: Median regression Cross-entropy loss Logistic regression Hinge loss SVM classification Softmax loss L(t; y(x)) = P n P k Multi-class probabilistic classification n I (t n = k) ln exp(y k(x)) P j exp(y j(x)) o 34

Regularization In addition, we can apply regularizers E.g., an L2 regularizer This is known as weight decay in Neural Networks. We can also apply other regularizers, e.g. L1 sparsity Since Neural Networks often have many parameters, regularization becomes very important in practice. We will see more complex regularization techniques later on... 35

Limitations of Perceptrons What makes the task difficult? Perceptrons with fixed, hand-coded input features can model any separable function perfectly......given the right input features. For some tasks this requires an exponential number of input features. E.g., by enumerating all possible binary input vectors as separate feature units (similar to a look-up table). But this approach won t generalize to unseen test cases! It is the feature design that solves the task! Once the hand-coded features have been determined, there are very strong limitations on what a perceptron can learn. Classic example: XOR function. 36

Wait... Didn t we just say that... Perceptrons correspond to generalized linear discriminants And Perceptrons are very limited... Doesn t this mean that what we have been doing so far in this lecture has the same problems??? Yes, this is the case. A linear classifier cannot solve certain problems (e.g., XOR). However, with a non-linear classifier based on the right kind of features, the problem becomes solvable. So far, we have solved such problems by hand-designing good features Á and kernels Á > Á. Can we also learn such feature representations? 37

Topics of This Lecture A Short History of Neural Networks Perceptrons Definition Loss functions Regularization Limits Multi-Layer Perceptrons Definition Learning 38

Multi-Layer Perceptrons Adding more layers Output layer Hidden layer Input layer Output Slide adapted from Stefan Roth 39

Multi-Layer Perceptrons Activation functions g (k) : For example: g (2) (a) = ¾(a), g (1) (a) = a The hidden layer can have an arbitrary number of nodes There can also be multiple hidden layers. Universal approximators A 2-layer network (1 hidden layer) can approximate any continuous function of a compact domain arbitrarily well! (assuming sufficient hidden nodes) Slide credit: Stefan Roth 40

Learning with Hidden Units Networks without hidden units are very limited in what they can learn More layers of linear units do not help still linear Fixed output non-linearities are not enough. We need multiple layers of adaptive non-linear hidden units. But how can we train such nets? Need an efficient way of adapting all weights, not just the last layer. Learning the weights to the hidden units = learning features This is difficult, because nobody tells us what the hidden units should do. Next lecture Slide adapted from Geoff Hinton 41

References and Further Reading More information on Neural Networks can be found in Chapters 6 and 7 of the Goodfellow & Bengio book Ian Goodfellow, Aaron Courville, Yoshua Bengio Deep Learning MIT Press, in preparation https://goodfeli.github.io/dlbook/ 42