Advanced Machine Learning Lecture 1 Organization Lecturer Prof. Bastian Leibe (leibe@vision.rwth-aachen.de) Introduction 20.10.2015 Bastian Leibe Visual Computing Institute RWTH Aachen University http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de Teaching Assistants Umer Rafi (rafi@vision.rwth-aachen.de) Lucas Beyer (beyer@vision.rwth-aachen.de) Course webpage http://www.vision.rwth-aachen.de/teaching/ Slides will be made available on the webpage There is also an L2P electronic repository Please subscribe to the lecture on the Campus system! Important to get email announcements and L2P access! 2 Language Relationship to Previous Courses Official course language will be English Lecture Machine Learning (past summer semester) If at least one English-speaking student is present. Introduction to ML If not you can choose. Classification Graphical models However Please tell me when I m talking too fast or when I should repeat something in German for better understanding! You may at any time ask questions in German! You may turn in your eercises in German. This course: Advanced Machine Learning Natural continuation of ML course Deeper look at the underlying concepts But: will try to make it accessible also to newcomers Quick poll: Who hasn t heard the ML lecture? You may take the oral eam in German. This year: Lots of new material Large lecture block on Deep Learning First time for us to teach this (so, bear with us...) 3 4 New Content This Year Organization Structure: 3V (lecture) + 1Ü (eercises) 6 EECS credits Part of the area Applied Computer Science Place & Time Lecture/Eercises: Mon 14:15 15:45 room UMIC 025 Lecture/Eercises: Thu 10:15 11:45 room UMIC 025 Deep Learning Eam Oral or written eam, depending on number of participants Towards the end of the semester, there will be a proposed date 5 6 1
Course Webpage Eercises and Supplementary Material Eercises Typically 1 eercise sheet every 2 weeks. Pen & paper and programming eercises Matlab for early topics Theano for Deep Learning topics Monday: Matlab tutorial Hands-on eperience with the algorithms from the lecture. Send your solutions the night before the eercise class. Supplementary material Research papers and book chapters Will be provided on the webpage. http://www.vision.rwth-aachen.de/teaching/ 7 8 Tetbooks How to Find Us Most lecture topics will be covered in Bishop s book. Some additional topics can be found in Rasmussen & Williams. Office: UMIC Research Centre Mies-van-der-Rohe-Strasse 15, room 124 Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 (available in the library s Handapparat ) Carl E. Rasmussen, Christopher K.I. Williams Gaussian Processes for Machine Learning MIT Press, 2006 (also available online: http://www.gaussianprocess.org/gpml/) Office hours If you have questions to the lecture, come see us. My regular office hours will be announced. Research papers will be given out for some topics. Send us an email before to confirm a time slot. Tutorials and deeper introductions. Application papers Questions are welcome! 9 10 Machine Learning Statistical Machine Learning Principles, methods, and algorithms for learning and prediction on the basis of past evidence Already everywhere Speech recognition (e.g. speed-dialing) Computer vision (e.g. face detection) Hand-written character recognition (e.g. letter delivery) Information retrieval (e.g. image & video indeing) Operation systems (e.g. caching) Fraud detection (e.g. credit cards) Tet filtering (e.g. email spam filters) Game playing (e.g. strategy prediction) Automatic Speech Recognition Robotics (e.g. prediction of battery lifetime) 11 12 2
Computer Vision (Object Recognition, Segmentation, Scene Understanding) 13 Information Retrieval (Retrieval, Categorization, Clustering,...) 14 Financial Prediction (Time series analysis,...) 15 Medical Diagnosis (Inference from partial observations) 16 Image from Kevin Murphy Bioinformatics (Modelling gene microarray data,...) 17 Robotics (DARPA Grand Challenge,...) 18 Image from Kevin Murphy 3
Machine Learning: Core Questions Learning to perform a task from eperience Machine Learning: Core Questions y = f(; w) Task Can often be epressed through a mathematical function y = f(; w) w: characterizes the family of functions w: indees the space of hypotheses w: vector, connection matri, graph, : Input y: Output w: Parameters (this is what is learned ) Classification vs. Regression Regression: continuous y Classification: discrete y E.g. class membership, sometimes also posterior probability 19 20 A Look Back: Lecture Machine Learning This Lecture: Advanced Machine Learning Fundamentals Bayes Decision Theory Probability Density Estimation Classification Approaches Linear Discriminant Functions Support Vector Machines Ensemble Methods & Boosting Randomized Trees, Forests & Ferns Generative Models Bayesian Networks Markov Random Fields Etending lecture Machine Learning from last semester Regression Approaches Linear Regression Regularization (Ridge, Lasso) Gaussian Processes Learning with Latent Variables EM and Generalizations Approimate Inference Deep Learning Neural Networks CNNs, RNNs, RBMs, etc. 21 Let s Get Started Topics of This Lecture Some of you already have basic ML background Who hasn t? We ll start with a gentle introduction I ll try to make the lecture also accessible to newcomers We ll review the main concepts before applying them I ll point out chapters to review from ML lecture whenever knowledge from there is needed/helpful But please tell me when I m moving too fast (or too slow) Regression: Motivation Polynomial fitting General Least-Squares Regression Overfitting problem Regularization Ridge Regression Recap: Important Concepts from ML Lecture Probability Theory Bayes Decision Theory Maimum Likelihood Estimation Bayesian Estimation A Probabilistic View on Regression Least-Squares Estimation as Maimum Likelihood 23 24 4
Regression Eample: Polynomial Curve Fitting Learning to predict a continuous function value Given: training set X = { 1,, N } with target values T = {t 1,, t N }. Learn a continuous function y() to predict the function value for a new input. Steps towards a solution Choose a form of the function y(,w) with parameters w. Toy dataset Generated by function Small level of random noise with Gaussian distribution added (blue dots) Define an error function E(w) to optimize. Optimize E(w) for w to find a good solution. (This may involve math). Derive the properties of this solution and think about its limitations. 25 Goal: fit a polynomial function to this data Note: Nonlinear function of, but linear function of the w j. 26 Error Function Minimizing the Error How to determine the values of the coefficients w? We need to define an error function to be minimized. This function specifies how a deviation from the target value should be weighted. How do we minimize the error? Popular choice: sum-of-squares error Definition Solution (Always!) Compute the derivative and set it to zero. We ll discuss the motivation for this particular function later Since the error is a quadratic function of w, its derivative will be linear in w. Minimization has a unique solution. 27 28 Least-Squares Regression Least-Squares Regression We have given Training data points: Associated function values: Start with linear regressor: Try to enforce X = f 1 2 R d ; : : : ; n g T = ft 1 2 R; : : : ; t n g One linear equation for each training data point / label pair. Setup Step 1: Define Step 2: Rewrite Step 3: Matri-vector notation µ µ i w ~ i = ; ~w = 1 w 0 This is the same basic setup used for least-squares classification! Only the values are now continuous. Step 4: Find least-squares solution with Solution: 29 30 5
Regression with Polynomials Varying the Order of the Polynomial. How can we fit arbitrary polynomials using least-squares regression? We introduce a feature transformation (as before in ML). y() = w T Á() MX = w i Á i () i=0 E.g.: Á() = (1; ; 2 ; 3 ) T Fitting a cubic polynomial. assume Á 0 () = 1 basis functions Massive overfitting! 31 Which one should we pick? 32 Analysis of the Results Overfitting Results for different values of M Problem Training data contains some noise Best representation of the original function sin(2¼) with M = 3. Higher-order polynomial fitted perfectly to the noise. We say it was overfitting to the training data. Perfect fit to the training data with M = 9, but poor representation of the original function. Goal is a good prediction of future data Our target function should fit well to the training data, but also generalize. Measure generalization performance on independent test set. Why is that??? After all, M = 9 contains M = 3 as a special case! 33 34 Measuring Generalization Analyzing Overfitting Eample: Polynomial of degree 9 Overfitting! E.g., Root Mean Square Error (RMS): Motivation Division by N lets us compare different data set sizes. Square root ensures E RMS is measured on the same scale (and in the same units) as the target variable t. 35 Relatively little data Overfitting typical Enough data Good estimate Overfitting becomes less of a problem with more data. 36 Slide adapted from Bernt Schiele 6
What Is Happening Here? Regularization The coefficients get very large: Fitting the data from before with various polynomials. Coefficients: What can we do then? How can we apply the approach to data sets of limited size? We still want to use relatively comple and fleible models. Workaround: Regularization Penalize large coefficient values Here we ve simply added a quadratic regularizer, which is simple to optimize The resulting form of the problem is called Ridge Regression. 37 (Note: w 0 is often omitted from the regularizer.) 38 Results with Regularization (M=9) RMS Error for Regularized Case 39 Effect of regularization The trade-off parameter now controls the effective model compleity and thus the degree of overfitting. 40 Summary Topics of This Lecture We ve seen several important concepts Linear regression Overfitting Role of the amount of data Role of model compleity Regularization How can we approach this more systematically? Would like to work with comple models. How can we prevent overfitting systematically? How can we avoid the need for validation on separate test data? What does it mean to do linear regression? What does it mean to do regularization? Regression: Motivation Polynomial fitting General Least-Squares Regression Overfitting problem Regularization Ridge Regression Recap: Important Concepts from ML Lecture Probability Theory Bayes Decision Theory Maimum Likelihood Estimation Bayesian Estimation A Probabilistic View on Regression Least-Squares Estimation as Maimum Likelihood 41 42 7
Recap: The Rules of Probability Recap: Bayes Decision Theory Basic rules p a p b Likelihood Sum Rule Product Rule p a p( a) p b p( b) Likelihood Prior From those, we can derive Decision boundary Bayes Theorem where 43 p a p b Likelihood Prior Posterior = NormalizationFactor 44 Recap: Gaussian (or Normal) Distribution Side Note One-dimensional case Mean ¹ Variance ¾ 2 N(j¹; ¾ 2 ) = p 1 ¾ ( ¹)2 ep ½ 2¼¾ 2¾ 2 Notation In many situations, it will be necessary to work with the inverse of the covariance matri : We call the precision matri. Multi-dimensional case Mean ¹ Covariance We can therefore also write the Gaussian as N(j¹; ) = ½ 1 ep 1 ¾ (2¼) D=2 j j1=2 2 ( ¹)T 1 ( ¹) 45 46 Recap: Parametric Methods Recap: Maimum Likelihood Approach Given Data X = f 1 ; 2 ; : : : ; N g Parametric form of the distribution with parameters µ E.g. for Gaussian distrib.: µ = (¹; ¾) Learning Estimation of the parameters µ Likelihood of µ Probability that the data X have indeed been generated from a probability density with parameters µ L(µ) = p(xjµ) Slide adapted from Bernt Schiele 47 Computation of the likelihood Single data point: p( n jµ) Assumption: all data points X = f 1; : : : ; ng are independent NY L(µ) = p(xjµ) = p( n jµ) Log-likelihood NX E(µ) = ln L(µ) = ln p( n jµ) n=1 Estimation of the parameters µ (Learning) Maimize the likelihood (=minimize the negative log-likelihood) n=1 Take the derivative and set it to zero. N @ @µ E(µ) = X n=1 @ @µ p( njµ) p( n jµ)! = 0 48 8
Recap: Maimum Likelihood Limitations Recap: Deeper Reason Maimum Likelihood has several significant limitations It systematically underestimates the variance of the distribution! E.g. consider the case N = 1; X = f 1 g Maimum Likelihood is a Frequentist concept In the Frequentist view, probabilities are the frequencies of random, repeatable events. These frequencies are fied, but can be estimated more precisely when more data is available. Maimum-likelihood estimate: We say ML overfits to the observed data. ^¾ = 0! We will still often use ML, but it is important to know about this effect. ^¹ This is in contrast to the Bayesian interpretation In the Bayesian view, probabilities quantify the uncertainty about certain states or events. This uncertainty can be revised in the light of new evidence. Bayesians and Frequentists do not like each other too well Slide adapted from Bernt Schiele 49 50 Recap: Bayesian Learning Approach Bayesian view: Consider the parameter vector µ as a random variable. When estimating the parameters, what we compute is Z p(jx) = p(; µjx)dµ Assumption: given µ, this doesn t depend on X anymore Z p(jx) = Slide adapted from Bernt Schiele p(; µjx) = p(jµ; X)p(µjX) p(jµ)p(µjx)dµ This is entirely determined by the parameter µ (i.e. by the parametric form of the pdf). 51 Recap: Bayesian Learning Approach Discussion Estimate for based on parametric form µ Likelihood of the parametric form µ given the data set X. Z p(jµ)l(µ)p(µ) p(jx) = R dµ L(µ)p(µ)dµ Normalization: integrate over all possible values of µ Prior for the parameters µ The more uncertain we are about µ, the more we average over all possible parameter values. 52 Topics of This Lecture Regression: Motivation Polynomial fitting General Least-Squares Regression Overfitting problem Regularization Ridge Regression Recap: Important Concepts from ML Lecture Probability Theory Bayes Decision Theory Maimum Likelihood Estimation Bayesian Estimation Net lecture A Probabilistic View on Regression Least-Squares Estimation as Maimum Likelihood 53 54 9
References and Further Reading More information, including a short review of Probability theory and a good introduction in Bayes Decision Theory can be found in Chapters 1.1, 1.2 and 1.5 of Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 63 10