Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Similar documents
Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Machine Learning Lecture 10

Machine Learning Lecture 12

Machine Learning Lecture 5

Machine Learning Lecture 7

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Reading Group on Deep Learning Session 1

Warm up: risk prediction with logistic regression

Logistic Regression. COMP 527 Danushka Bollegala

Ch 4. Linear Models for Classification

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Logistic Regression & Neural Networks

Machine Learning Lecture 14

Linear & nonlinear classifiers

Linear Models for Classification

Grundlagen der Künstlichen Intelligenz

CSC 411 Lecture 17: Support Vector Machine

Machine Learning Lecture 2

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning Lecture 2

Lecture Machine Learning (past summer semester) If at least one English-speaking student is present.

Machine Learning Lecture 2

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Neural Networks and Deep Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Lecture 17: Neural Networks and Deep Learning

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Lecture 5: Logistic Regression. Neural Networks

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Support Vector Machines for Classification: A Statistical Portrait

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Kernel Methods and Support Vector Machines

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Neural networks and support vector machines

Neural Network Training

Support Vector Machine. Industrial AI Lab.

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

(Kernels +) Support Vector Machines

Machine Learning for Signal Processing Bayes Classification and Regression

Lecture 21. Repetition Bastian Leibe RWTH Aachen Advanced Machine Learning

6.036 midterm review. Wednesday, March 18, 15

Statistical Pattern Recognition

Discriminative Models

Neural networks and optimization

Statistical Data Mining and Machine Learning Hilary Term 2016

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Discriminative Models

18.9 SUPPORT VECTOR MACHINES

ML4NLP Multiclass Classification

Announcements - Homework

Machine Learning for Structured Prediction

Classification Logistic Regression

Machine Learning for NLP

Binary Classification / Perceptron

Kernelized Perceptron Support Vector Machines

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Linear & nonlinear classifiers

Introduction to Convolutional Neural Networks (CNNs)

Making Deep Learning Understandable for Analyzing Sequential Data about Gene Regulation

From perceptrons to word embeddings. Simon Šuster University of Groningen

Artificial Neural Networks. MGS Lecture 2

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

An Introduction to Statistical and Probabilistic Linear Models

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Linear discriminant functions

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Learning Deep Architectures for AI. Part II - Vijay Chakilam

2018 EE448, Big Data Mining, Lecture 5. (Part II) Weinan Zhang Shanghai Jiao Tong University

L5 Support Vector Classification

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Final Examination CS540-2: Introduction to Artificial Intelligence

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Recent Advances in Bayesian Inference Techniques

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture

Pattern Recognition and Machine Learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Support vector machines Lecture 4

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Multilayer Perceptron

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Support Vector Machine (continued)

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Overfitting, Bias / Variance Analysis

CSC321 Lecture 5: Multilayer Perceptrons

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Summary and discussion of: Dropout Training as Adaptive Regularization

Transcription:

Advanced Machine Learning Lecture 2 Neural Networks 24..206 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de Talk Announcement Yann LeCun (NYU & FaceBook AI) 28.. 5:00-6:30h, SuperC 6 th floor (Ford Saal) The rapid progress of AI in the last few years are largely the result of advances in deep learning and neural nets, combined with the availability of large datasets and fast GPUs. We now have systems that can recognize images with an accuracy that rivals that of humans. This will lead to revolutions in several domains such as autonomous transportation and medical image analysis. But all of these systems currently use supervised learning in which the machine is trained with inputs labeled by humans. The challenge of the next several years is to let machines learn from raw, unlabeled data, such as video or text. This is known as predictive (or unsupervised) learning. Intelligent systems today do not possess "common sense", which humans and animals acquire by observing the world, by acting in it, and by understanding the physical constraints of it. I will argue that the ability of machines to learn predictive models of the world is a key component of that will enable significant progress in AI. The main technical difficulty is that the world is only partially predictable. A general formulation of unsupervised learning that deals with partial predictability will be presented. The formulation connects many well-known approaches to unsupervised learning, as well as new and exciting ones such as adversarial training. No lecture next Monday - go see the talk! 2 This Lecture: Advanced Machine Learning Recap: Generalized Linear Discriminants Regression Approaches Linear Regression Regularization (Ridge, Lasso) Kernels (Kernel Ridge Regression) Gaussian Processes Extension with non-linear basis functions Transform vector x with M nonlinear basis functions Á j (x): Approximate Inference Sampling Approaches MCMC Basis functions Á j (x) allow non-linear decision boundaries. Activation function g( ) bounds the influence of outliers. Disadvantage: minimization no longer in closed form. Deep Learning Linear Discriminants Neural Networks Backpropagation CNNs, RNNs, ResNets, etc. Notation Slide adapted from Bernt Schiele with Á 0 (x) = 4 Recap: Gradient Descent Recap: Gradient Descent Iterative minimization Start with an initial guess for the parameter values w (0). Move towards a (local) minimum by following the gradient. Basic strategies Batch learning w ( +) @E(w) @w w( ) Example: Quadratic error function E(w) = (y(x n ; w) t n ) 2 Sequential updating leads to delta rule (=LMS rule) w ( +) (y k(x n ; w) t kn ) Á j (x n ) Sequential updating where w ( +) E(w) = E n (w) @E n(w) @w w( ) 5 where ± kná j (x n ) ± kn = y k (x n ; w) t kn Simply feed back the input data point, weighted by the classification error. Slide adapted from Bernt Schiele 6

Recap: Probabilistic Discriminative Models Recap: Logistic Regression Consider models of the form p(c já) = y(á) = ¾(w T Á) Let s consider a data set {Á n,t n } with n =,,N, where Á n = Á(x n ) and t n 2 f0; g, t = (t ; : : : ; t N ) T. with This model is called logistic regression. Properties p(c 2 já) = p(c já) Probabilistic interpretation But discriminative method: only focus on decision hyperplane Advantageous for high-dimensional spaces, requires less parameters than explicitly modeling p(á C k ) and p(c k ). 7 With y n = p(c Á n ), we can write the likelihood as NY p(tjw) = yn tn f y ng tn Define the error function as the negative log-likelihood E(w) = ln p(tjw) = ft n ln y n + ( t n ) ln( y n )g This is the so-called cross-entropy error function. 9 Softmax Regression Softmax Regression Cost Function Multi-class generalization of logistic regression In logistic regression, we assumed binary labels t n 2 f0; g Softmax generalizes this to K values in -of-k notation. 2 3 2 P (y = jx; w) exp(w > 3 x) P (y = 2jx; w) y(x; w) = 6 4 7. 5 = exp(w 2 > P K j= exp(w> j x) 6 x) 7 4. 5 P (y = Kjx; w) exp(w K > x) This uses the softmax function Note: the resulting distribution is normalized. 2 Logistic regression Alternative way of writing the cost function E(w) = ft n ln y n + ( t n ) ln( y n )g X = fi (t n = k) ln P (y n = x n ; w)g k=0 Softmax regression Generalization to K classes using indicator functions. ( ) KX exp(w k > E(w) = I (t n = k) ln P x) K k= j= exp(w> j x) r wk E(w) = [I (t n = k) ln P (y n = x n ; w)] 3 Optimization A Note on Error Functions Again, no closed-form solution is available Resort again to Gradient Descent Gradient Ideal misclassification error r wk E(w) = [I (t n = k) ln P (y n = x n ; w)] Note r wk E(w) is itself a vector of partial derivatives for the different components of w k. We can now plug this into a standard optimization package. 4 Not differentiable! Ideal misclassification error function (black) This is what we want to approximate, Unfortunately, it is not differentiable. The gradient is zero for misclassified points. z n = t n y(x n ) We cannot minimize it by gradient descent. 5 Image source: Bishop, 2006 2

A Note on Error Functions Sensitive to outliers! Ideal misclassification error Squared error A Note on Error Functions Robust to outliers! Ideal misclassification error Squared error Cross-entropy error Penalizes too correct data points! z n = t n y(x n ) z n = t n y(x n ) Squared error used in Least-Squares Classification Very popular, leads to closed-form solutions. However, sensitive to outliers due to squared penalty. Penalizes too correct data points Generally does not lead to good classifiers. 6 Image source: Bishop, 2006 Cross-Entropy Error Minimizer of this error is given by posterior class probabilities. Concave error function, unique minimum exists. Robust to outliers, error increases only roughly linearly But no closed-form solution, requires iterative estimation. 7 Image source: Bishop, 2006 Side Note: Support Vector Machine (SVM) Basic idea The SVM tries to find a classifier which maximizes the margin between pos. and neg. data points. Up to now: consider linear classifiers w T x + b = 0 Formulation as a convex optimization problem Find the hyperplane satisfying arg min w;b 2 kwk2 under the constraints t n (w T x n + b) 8n based on training data points x n and target values t n 2 f ; g. Margin 8 SVM Analysis Traditional soft-margin formulation min w2r D ;»n2r + 2 kwk2 + C» n subject to the constraints Different way of looking at it We can reformulate the constraints into the objective function. min w2r D 2 kwk2 + C [ t n y(x n )] + L 2 regularizer where [x] + := max{0,x}. Slide adapted from Christoph Lampert Hinge loss Maximize the margin Most points should be on the correct side of the margin 9 SVM Error Function (Loss Function) SVM Discussion Robust to outliers! Ideal misclassification error Squared error Hinge error SVM optimization function min w2r D 2 kwk2 + C L 2 regularizer [ t n y(x n )] + Hinge loss Not differentiable! Hinge error used in SVMs Zero error for points outside the margin (z n > ). Linearly increasing error for misclassified points (z n < ). Leads to sparse solutions, not sensitive to outliers. Favors sparse solutions! Not differentiable around z n = Cannot be optimized directly. 20 Image source: Bishop, 2006 Hinge loss enforces sparsity Only a subset of training data points actually influences the decision boundary. This is different from sparsity obtained through the regularizer! There, only a subset of input dimensions are used. Unconstrained optimization, but non-differentiable function. Solve, e.g. by subgradient descent Currently most efficient: stochastic gradient descent Slide adapted from Christoph Lampert 2 3

Topics of This Lecture Perceptrons Loss functions Regularization Limits And a cool learning algorithm: Perceptron Learning Hardware implementation Mark I Perceptron for 20 20 pixel image analysis Multi-Layer Perceptrons Learning The embryo of an electronic computer that [...] will be able to walk, talk, see, write, reproduce itself and be conscious of its existence. 22 23 Image source: Wikipedia, clipartpanda.com They showed that (single-layer) Perceptrons cannot solve all problems. This was misunderstood by many that they were worthless. Some notable successes with multi-layer perceptrons. Backpropagation learning algorithm Neural Networks don t work! OMG! They work like the human brain! Oh no! Killer robots will achieve world domination! 24 Image source: colourbox.de, thinkstock 25 Image sources: clipartpanda.com, cliparts.co Some notable successes with multi-layer perceptrons. Backpropagation learning algorithm But they are hard to train, tend to overfit, and have unintuitive parameters. So, the excitement fades again. sigh! 995+ Interest shifts to other learning methods Notably Support Vector Machines Machine Learning becomes a discipline of its own. I can do science, me! 26 Image source: clipartof.com, colourbox.de 27 Image source: clipartof.com 4

995+ Interest shifts to other learning methods Notably Support Vector Machines Machine Learning becomes a discipline of its own. The general public and the press still love Neural Networks. I m doing Machine Learning. 995+ Interest shifts to other learning methods 2005+ Gradual progress Better understanding how to successfully train deep networks Availability of large datasets and powerful GPUs Still largely under the radar for many disciplines applying ML So, you re using Neural Networks? Are you using Neural Networks? Actually... Come on. Get real! 28 29 Topics of This Lecture 995+ Interest shifts to other learning methods 2005+ Gradual progress 202 Breakthrough results ImageNet Large Scale Visual Recognition Challenge A ConvNet halves the error rate of dedicated vision approaches. Deep Learning is widely adopted. It works! A Short History of Neural Networks Perceptrons Loss functions Regularization Limits Multi-Layer Perceptrons Learning 30 Image source: clipartpanda.com, clipartof.com 3 Perceptrons (Rosenblatt 957) Standard Perceptron Extension: Multi-Class Networks One output node per class Weights Weights Input Layer Hand-designed features based on common sense Outputs Linear outputs Logistic outputs Outputs Linear outputs Logistic outputs Learning = Determining the weights w 32 Can be used to do multidimensional linear regression or multiclass classification. Slide adapted from Stefan Roth 33 5

Extension: Non-Linear Basis Functions Straightforward generalization Extension: Non-Linear Basis Functions Straightforward generalization Weights W kd Weights Feature layer Feature layer Mapping (fixed) Mapping (fixed) Outputs Linear outputs Logistic outputs Remarks Perceptrons are generalized linear discriminants! Everything we know about the latter can also be applied here. Note: feature functions Á(x) are kept fixed, not learned! 34 35 Perceptron Learning Perceptron Learning Very simple algorithm Let s analyze this algorithm... Process the training cases in some permutation If the output unit is correct, leave the weights alone. If the output unit incorrectly outputs a zero, add the input vector to the weight vector. If the output unit incorrectly outputs a one, subtract the input vector from the weight vector. Process the training cases in some permutation If the output unit is correct, leave the weights alone. If the output unit incorrectly outputs a zero, add the input vector to the weight vector. If the output unit incorrectly outputs a one, subtract the input vector from the weight vector. This is guaranteed to converge to a correct solution if such a solution exists. Translation w ( +) (y k(x n ; w) t kn ) Á j (x n ) Slide adapted from Geoff Hinton 36 Slide adapted from Geoff Hinton 37 Perceptron Learning Let s analyze this algorithm... Process the training cases in some permutation If the output unit is correct, leave the weights alone. If the output unit incorrectly outputs a zero, add the input vector to the weight vector. If the output unit incorrectly outputs a one, subtract the input vector from the weight vector. Translation w ( +) (y k(x k(x n ; w) t kn ) Á j (x n ) This is the Delta rule a.k.a. LMS rule! Perceptron Learning corresponds to st -order (stochastic) Gradient Descent of a quadratic error function! Slide adapted from Geoff Hinton 38 Loss Functions We can now also apply other loss functions L2 loss L loss: Cross-entropy loss Hinge loss Softmax loss L(t; y(x)) = P n Least-squares regression Median regression Logistic regression SVM classification Multi-class probabilistic classification P n o k I (t n = k) ln P exp(yk(x)) j exp(yj(x)) 39 6

Regularization Limitations of Perceptrons In addition, we can apply regularizers E.g., an L2 regularizer What makes the task difficult? Perceptrons with fixed, hand-coded input features can model any separable function perfectly......given the right input features. This is known as weight decay in Neural Networks. We can also apply other regularizers, e.g. L sparsity Since Neural Networks often have many parameters, regularization becomes very important in practice. We will see more complex regularization techniques later on... For some tasks this requires an exponential number of input features. E.g., by enumerating all possible binary input vectors as separate feature units (similar to a look-up table). But this approach won t generalize to unseen test cases! It is the feature design that solves the task! Once the hand-coded features have been determined, there are very strong limitations on what a perceptron can learn. Classic example: XOR function. 40 4 Wait... Topics of This Lecture Didn t we just say that... Perceptrons correspond to generalized linear discriminants And Perceptrons are very limited... Doesn t this mean that what we have been doing so far in this lecture has the same problems??? Yes, this is the case. A linear classifier cannot solve certain problems (e.g., XOR). However, with a non-linear classifier based on the right kind of features, the problem becomes solvable. So far, we have solved such problems by hand-designing good features Á and kernels Á > Á. A Short History of Neural Networks Perceptrons Loss functions Regularization Limits Multi-Layer Perceptrons Learning Can we also learn such feature representations? 42 43 Multi-Layer Perceptrons Multi-Layer Perceptrons Adding more layers Hidden layer Activation functions g (k) : For example: g (2) (a) = ¾(a), g () (a) = a The hidden layer can have an arbitrary number of nodes There can also be multiple hidden layers. Output Universal approximators A 2-layer network ( hidden layer) can approximate any continuous function of a compact domain arbitrarily well! (assuming sufficient hidden nodes) Slide adapted from Stefan Roth 44 Slide credit: Stefan Roth 45 7

Learning with Hidden Units References and Further Reading Networks without hidden units are very limited in what they can learn More information on Neural Networks can be found in Chapters 6 and 7 of the Goodfellow & Bengio book More layers of linear units do not help still linear Fixed output non-linearities are not enough. We need multiple layers of adaptive non-linear hidden units. But how can we train such nets? Ian Goodfellow, Aaron Courville, Yoshua Bengio Deep Learning MIT Press, in preparation Need an efficient way of adapting all weights, not just the last layer. Learning the weights to the hidden units = learning features This is difficult, because nobody tells us what the hidden units should do. Next lecture https://goodfeli.github.io/dlbook/ Slide adapted from Geoff Hinton 46 47 8