Lecture Machine Learning (past summer semester) If at least one English-speaking student is present.

Similar documents
Machine Learning Lecture 2

Machine Learning Lecture 2

Machine Learning Lecture 1

Machine Learning Lecture 2

Machine Learning Lecture 3

Machine Learning Lecture 5

Machine Learning Lecture 7

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

CS 540: Machine Learning Lecture 1: Introduction

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

6.867 Machine learning

Mathematical Formulation of Our Example

COURSE INTRODUCTION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Recent Advances in Bayesian Inference Techniques

Machine Learning (CS 567) Lecture 2

Machine Learning Lecture 10

Intelligent Systems I

Lecture : Probabilistic Machine Learning

Machine Learning CSE546

Machine Learning Lecture 14

Machine Learning Lecture 12

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Probabilistic Graphical Models for Image Analysis - Lecture 1

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

Learning from Data. Amos Storkey, School of Informatics. Semester 1. amos/lfd/

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

CMU-Q Lecture 24:

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Lecture 3: Pattern Classification. Pattern classification

ECE521 week 3: 23/26 January 2017

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

3.2 Logarithmic Functions and Their Graphs

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Bayesian Inference of Noise Levels in Regression

CS 6375 Machine Learning

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

STA 4273H: Statistical Machine Learning

Brief Introduction of Machine Learning Techniques for Content Analysis

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

Overfitting, Bias / Variance Analysis

Pattern Recognition and Machine Learning

Bayesian Machine Learning

CSCI-567: Machine Learning (Spring 2019)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 2: Linear regression

Logistic Regression. COMP 527 Danushka Bollegala

An Introduction to Statistical and Probabilistic Linear Models

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

STA 414/2104, Spring 2014, Practice Problem Set #1

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

Introduction to Machine Learning

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

PATTERN RECOGNITION AND MACHINE LEARNING

Notes on Discriminant Functions and Optimal Classification

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Machine Learning Basics III

STA414/2104 Statistical Methods for Machine Learning II

Machine Learning Linear Classification. Prof. Matteo Matteucci

CSC321 Lecture 2: Linear Regression

CS534 Machine Learning - Spring Final Exam

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Week 3: Linear Regression

Learning from Data: Regression

Overview. IAML: Linear Regression. Examples of regression problems. The Regression Problem

CS 188: Artificial Intelligence Spring Announcements

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Machine Learning, Fall 2009: Midterm

Where now? Machine Learning and Bayesian Inference

GWAS IV: Bayesian linear (variance component) models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Nonparametric Bayesian Methods (Gaussian Processes)

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

MACHINE LEARNING ADVANCED MACHINE LEARNING

Introduction. Chapter 1

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Regression with Input-Dependent Noise: A Bayesian Treatment

STAT 518 Intro Student Presentation

Unsupervised Learning

STA 4273H: Sta-s-cal Machine Learning

GWAS V: Gaussian processes

Machine Learning Lecture 14

BAYESIAN DECISION THEORY

Mobile Robot Localization

Introduction to Bayesian Learning

Transcription:

Advanced Machine Learning Lecture 1 Organization Lecturer Prof. Bastian Leibe (leibe@vision.rwth-aachen.de) Introduction 20.10.2015 Bastian Leibe Visual Computing Institute RWTH Aachen University http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de Teaching Assistants Umer Rafi (rafi@vision.rwth-aachen.de) Lucas Beyer (beyer@vision.rwth-aachen.de) Course webpage http://www.vision.rwth-aachen.de/teaching/ Slides will be made available on the webpage There is also an L2P electronic repository Please subscribe to the lecture on the Campus system! Important to get email announcements and L2P access! 2 Language Relationship to Previous Courses Official course language will be English Lecture Machine Learning (past summer semester) If at least one English-speaking student is present. Introduction to ML If not you can choose. Classification Graphical models However Please tell me when I m talking too fast or when I should repeat something in German for better understanding! You may at any time ask questions in German! You may turn in your eercises in German. This course: Advanced Machine Learning Natural continuation of ML course Deeper look at the underlying concepts But: will try to make it accessible also to newcomers Quick poll: Who hasn t heard the ML lecture? You may take the oral eam in German. This year: Lots of new material Large lecture block on Deep Learning First time for us to teach this (so, bear with us...) 3 4 New Content This Year Organization Structure: 3V (lecture) + 1Ü (eercises) 6 EECS credits Part of the area Applied Computer Science Place & Time Lecture/Eercises: Mon 14:15 15:45 room UMIC 025 Lecture/Eercises: Thu 10:15 11:45 room UMIC 025 Deep Learning Eam Oral or written eam, depending on number of participants Towards the end of the semester, there will be a proposed date 5 6 1

Course Webpage Eercises and Supplementary Material Eercises Typically 1 eercise sheet every 2 weeks. Pen & paper and programming eercises Matlab for early topics Theano for Deep Learning topics Monday: Matlab tutorial Hands-on eperience with the algorithms from the lecture. Send your solutions the night before the eercise class. Supplementary material Research papers and book chapters Will be provided on the webpage. http://www.vision.rwth-aachen.de/teaching/ 7 8 Tetbooks How to Find Us Most lecture topics will be covered in Bishop s book. Some additional topics can be found in Rasmussen & Williams. Office: UMIC Research Centre Mies-van-der-Rohe-Strasse 15, room 124 Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 (available in the library s Handapparat ) Carl E. Rasmussen, Christopher K.I. Williams Gaussian Processes for Machine Learning MIT Press, 2006 (also available online: http://www.gaussianprocess.org/gpml/) Office hours If you have questions to the lecture, come see us. My regular office hours will be announced. Research papers will be given out for some topics. Send us an email before to confirm a time slot. Tutorials and deeper introductions. Application papers Questions are welcome! 9 10 Machine Learning Statistical Machine Learning Principles, methods, and algorithms for learning and prediction on the basis of past evidence Already everywhere Speech recognition (e.g. speed-dialing) Computer vision (e.g. face detection) Hand-written character recognition (e.g. letter delivery) Information retrieval (e.g. image & video indeing) Operation systems (e.g. caching) Fraud detection (e.g. credit cards) Tet filtering (e.g. email spam filters) Game playing (e.g. strategy prediction) Automatic Speech Recognition Robotics (e.g. prediction of battery lifetime) 11 12 2

Computer Vision (Object Recognition, Segmentation, Scene Understanding) 13 Information Retrieval (Retrieval, Categorization, Clustering,...) 14 Financial Prediction (Time series analysis,...) 15 Medical Diagnosis (Inference from partial observations) 16 Image from Kevin Murphy Bioinformatics (Modelling gene microarray data,...) 17 Robotics (DARPA Grand Challenge,...) 18 Image from Kevin Murphy 3

Machine Learning: Core Questions Learning to perform a task from eperience Machine Learning: Core Questions y = f(; w) Task Can often be epressed through a mathematical function y = f(; w) w: characterizes the family of functions w: indees the space of hypotheses w: vector, connection matri, graph, : Input y: Output w: Parameters (this is what is learned ) Classification vs. Regression Regression: continuous y Classification: discrete y E.g. class membership, sometimes also posterior probability 19 20 A Look Back: Lecture Machine Learning This Lecture: Advanced Machine Learning Fundamentals Bayes Decision Theory Probability Density Estimation Classification Approaches Linear Discriminant Functions Support Vector Machines Ensemble Methods & Boosting Randomized Trees, Forests & Ferns Generative Models Bayesian Networks Markov Random Fields Etending lecture Machine Learning from last semester Regression Approaches Linear Regression Regularization (Ridge, Lasso) Gaussian Processes Learning with Latent Variables EM and Generalizations Approimate Inference Deep Learning Neural Networks CNNs, RNNs, RBMs, etc. 21 Let s Get Started Topics of This Lecture Some of you already have basic ML background Who hasn t? We ll start with a gentle introduction I ll try to make the lecture also accessible to newcomers We ll review the main concepts before applying them I ll point out chapters to review from ML lecture whenever knowledge from there is needed/helpful But please tell me when I m moving too fast (or too slow) Regression: Motivation Polynomial fitting General Least-Squares Regression Overfitting problem Regularization Ridge Regression Recap: Important Concepts from ML Lecture Probability Theory Bayes Decision Theory Maimum Likelihood Estimation Bayesian Estimation A Probabilistic View on Regression Least-Squares Estimation as Maimum Likelihood 23 24 4

Regression Eample: Polynomial Curve Fitting Learning to predict a continuous function value Given: training set X = { 1,, N } with target values T = {t 1,, t N }. Learn a continuous function y() to predict the function value for a new input. Steps towards a solution Choose a form of the function y(,w) with parameters w. Toy dataset Generated by function Small level of random noise with Gaussian distribution added (blue dots) Define an error function E(w) to optimize. Optimize E(w) for w to find a good solution. (This may involve math). Derive the properties of this solution and think about its limitations. 25 Goal: fit a polynomial function to this data Note: Nonlinear function of, but linear function of the w j. 26 Error Function Minimizing the Error How to determine the values of the coefficients w? We need to define an error function to be minimized. This function specifies how a deviation from the target value should be weighted. How do we minimize the error? Popular choice: sum-of-squares error Definition Solution (Always!) Compute the derivative and set it to zero. We ll discuss the motivation for this particular function later Since the error is a quadratic function of w, its derivative will be linear in w. Minimization has a unique solution. 27 28 Least-Squares Regression Least-Squares Regression We have given Training data points: Associated function values: Start with linear regressor: Try to enforce X = f 1 2 R d ; : : : ; n g T = ft 1 2 R; : : : ; t n g One linear equation for each training data point / label pair. Setup Step 1: Define Step 2: Rewrite Step 3: Matri-vector notation µ µ i w ~ i = ; ~w = 1 w 0 This is the same basic setup used for least-squares classification! Only the values are now continuous. Step 4: Find least-squares solution with Solution: 29 30 5

Regression with Polynomials Varying the Order of the Polynomial. How can we fit arbitrary polynomials using least-squares regression? We introduce a feature transformation (as before in ML). y() = w T Á() MX = w i Á i () i=0 E.g.: Á() = (1; ; 2 ; 3 ) T Fitting a cubic polynomial. assume Á 0 () = 1 basis functions Massive overfitting! 31 Which one should we pick? 32 Analysis of the Results Overfitting Results for different values of M Problem Training data contains some noise Best representation of the original function sin(2¼) with M = 3. Higher-order polynomial fitted perfectly to the noise. We say it was overfitting to the training data. Perfect fit to the training data with M = 9, but poor representation of the original function. Goal is a good prediction of future data Our target function should fit well to the training data, but also generalize. Measure generalization performance on independent test set. Why is that??? After all, M = 9 contains M = 3 as a special case! 33 34 Measuring Generalization Analyzing Overfitting Eample: Polynomial of degree 9 Overfitting! E.g., Root Mean Square Error (RMS): Motivation Division by N lets us compare different data set sizes. Square root ensures E RMS is measured on the same scale (and in the same units) as the target variable t. 35 Relatively little data Overfitting typical Enough data Good estimate Overfitting becomes less of a problem with more data. 36 Slide adapted from Bernt Schiele 6

What Is Happening Here? Regularization The coefficients get very large: Fitting the data from before with various polynomials. Coefficients: What can we do then? How can we apply the approach to data sets of limited size? We still want to use relatively comple and fleible models. Workaround: Regularization Penalize large coefficient values Here we ve simply added a quadratic regularizer, which is simple to optimize The resulting form of the problem is called Ridge Regression. 37 (Note: w 0 is often omitted from the regularizer.) 38 Results with Regularization (M=9) RMS Error for Regularized Case 39 Effect of regularization The trade-off parameter now controls the effective model compleity and thus the degree of overfitting. 40 Summary Topics of This Lecture We ve seen several important concepts Linear regression Overfitting Role of the amount of data Role of model compleity Regularization How can we approach this more systematically? Would like to work with comple models. How can we prevent overfitting systematically? How can we avoid the need for validation on separate test data? What does it mean to do linear regression? What does it mean to do regularization? Regression: Motivation Polynomial fitting General Least-Squares Regression Overfitting problem Regularization Ridge Regression Recap: Important Concepts from ML Lecture Probability Theory Bayes Decision Theory Maimum Likelihood Estimation Bayesian Estimation A Probabilistic View on Regression Least-Squares Estimation as Maimum Likelihood 41 42 7

Recap: The Rules of Probability Recap: Bayes Decision Theory Basic rules p a p b Likelihood Sum Rule Product Rule p a p( a) p b p( b) Likelihood Prior From those, we can derive Decision boundary Bayes Theorem where 43 p a p b Likelihood Prior Posterior = NormalizationFactor 44 Recap: Gaussian (or Normal) Distribution Side Note One-dimensional case Mean ¹ Variance ¾ 2 N(j¹; ¾ 2 ) = p 1 ¾ ( ¹)2 ep ½ 2¼¾ 2¾ 2 Notation In many situations, it will be necessary to work with the inverse of the covariance matri : We call the precision matri. Multi-dimensional case Mean ¹ Covariance We can therefore also write the Gaussian as N(j¹; ) = ½ 1 ep 1 ¾ (2¼) D=2 j j1=2 2 ( ¹)T 1 ( ¹) 45 46 Recap: Parametric Methods Recap: Maimum Likelihood Approach Given Data X = f 1 ; 2 ; : : : ; N g Parametric form of the distribution with parameters µ E.g. for Gaussian distrib.: µ = (¹; ¾) Learning Estimation of the parameters µ Likelihood of µ Probability that the data X have indeed been generated from a probability density with parameters µ L(µ) = p(xjµ) Slide adapted from Bernt Schiele 47 Computation of the likelihood Single data point: p( n jµ) Assumption: all data points X = f 1; : : : ; ng are independent NY L(µ) = p(xjµ) = p( n jµ) Log-likelihood NX E(µ) = ln L(µ) = ln p( n jµ) n=1 Estimation of the parameters µ (Learning) Maimize the likelihood (=minimize the negative log-likelihood) n=1 Take the derivative and set it to zero. N @ @µ E(µ) = X n=1 @ @µ p( njµ) p( n jµ)! = 0 48 8

Recap: Maimum Likelihood Limitations Recap: Deeper Reason Maimum Likelihood has several significant limitations It systematically underestimates the variance of the distribution! E.g. consider the case N = 1; X = f 1 g Maimum Likelihood is a Frequentist concept In the Frequentist view, probabilities are the frequencies of random, repeatable events. These frequencies are fied, but can be estimated more precisely when more data is available. Maimum-likelihood estimate: We say ML overfits to the observed data. ^¾ = 0! We will still often use ML, but it is important to know about this effect. ^¹ This is in contrast to the Bayesian interpretation In the Bayesian view, probabilities quantify the uncertainty about certain states or events. This uncertainty can be revised in the light of new evidence. Bayesians and Frequentists do not like each other too well Slide adapted from Bernt Schiele 49 50 Recap: Bayesian Learning Approach Bayesian view: Consider the parameter vector µ as a random variable. When estimating the parameters, what we compute is Z p(jx) = p(; µjx)dµ Assumption: given µ, this doesn t depend on X anymore Z p(jx) = Slide adapted from Bernt Schiele p(; µjx) = p(jµ; X)p(µjX) p(jµ)p(µjx)dµ This is entirely determined by the parameter µ (i.e. by the parametric form of the pdf). 51 Recap: Bayesian Learning Approach Discussion Estimate for based on parametric form µ Likelihood of the parametric form µ given the data set X. Z p(jµ)l(µ)p(µ) p(jx) = R dµ L(µ)p(µ)dµ Normalization: integrate over all possible values of µ Prior for the parameters µ The more uncertain we are about µ, the more we average over all possible parameter values. 52 Topics of This Lecture Regression: Motivation Polynomial fitting General Least-Squares Regression Overfitting problem Regularization Ridge Regression Recap: Important Concepts from ML Lecture Probability Theory Bayes Decision Theory Maimum Likelihood Estimation Bayesian Estimation Net lecture A Probabilistic View on Regression Least-Squares Estimation as Maimum Likelihood 53 54 9

References and Further Reading More information, including a short review of Probability theory and a good introduction in Bayes Decision Theory can be found in Chapters 1.1, 1.2 and 1.5 of Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 63 10