Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Size: px
Start display at page:

Download "Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen"

Transcription

1 Advanced Machine Learning Lecture 3 Linear Regression II Bastian Leibe RWTH Aachen leibe@vision.rwth-aachen.de

2 This Lecture: Advanced Machine Learning Regression Approaches Linear Regression Regularization (Ridge, Lasso) Support Vector Regression Gaussian Processes Learning with Latent Variables EM and Generalizations Dirichlet Processes Structured Output Learning Large-margin Learning

3 Topics of This Lecture Recap: Probabilistic View on Regression Properties of Linear Regression Loss functions for regression Basis functions Multiple Outputs Sequential Estimation Regularization revisited Regularized Least-squares The Lasso Discussion Bias-Variance Decomposition 3

4 Recap: Probabilistic Regression First assumption: Our target function values t are generated by adding noise to the ideal function estimate: Target function value Noise Regression function Input value Weights or parameters Second assumption: The noise is Gaussian distributed. Slide adapted from Bernt Schiele Mean Variance ( precision) 4

5 Recap: Probabilistic Regression Given Training data points: Associated function values: X = [x 1 ; : : : ; x n ] 2 R d n t = [t 1 ; : : : ; t n ] T Conditional likelihood (assuming i.i.d. data) Maximize w.r.t. w, Generalized linear regression function Slide adapted from Bernt Schiele 5

6 Recap: Maximum Likelihood Regression Setting the gradient to zero: = [Á(x 1 ); : : : ; Á(x n )] Least-squares regression is equivalent to Maximum Likelihood under the assumption of Gaussian noise. Slide adapted from Bernt Schiele Same as in least-squares regression! 6

7 Recap: Role of the Precision Parameter Also use ML to determine the precision parameter : Gradient w.r.t. : The inverse of the noise precision is given by the residual variance of the target values around the regression function. 7

8 Recap: Predictive Distribution Having determined the parameters w and, we can now make predictions for new values of x. This means Rather than giving a point estimate, we can now also give an estimate of the estimation uncertainty. Image source: C.M. Bishop,

9 Recap: Maximum-A-Posteriori Estimation Introduce a prior distribution over the coefficients w. For simplicity, assume a zero-mean Gaussian distribution New hyperparameter controls the distribution of model parameters. Express the posterior distribution over w. Using Bayes theorem: We can now determine w by maximizing the posterior. This technique is called maximum-a-posteriori (MAP). 9

10 Recap: MAP Solution Minimize the negative logarithm The MAP solution is therefore Maximizing the posterior distribution is equivalent to minimizing the regularized sum-of-squares error (with ). 10

11 MAP Solution (2) Setting the gradient to zero: = [Á(x 1 ); : : : ; Á(x n )] Effect of regularization: Keeps the inverse well-conditioned 11

12 Recap: Bayesian Curve Fitting Given Training data points: Associated function values: X = [x 1 ; : : : ; x n ] 2 R d n t = [t 1 ; : : : ; t n ] T Our goal is to predict the value of t for a new point x. Evaluate the predictive distribution What we just computed for MAP Noise distribition again assume a Gaussian here Assume that parameters and are fixed and known for now. 12

13 Bayesian Curve Fitting Under those assumptions, the posterior distribution is a Gaussian and can be evaluated analytically: where the mean and variance are given by and S is the regularized covariance matrix 13 Image source: C.M. Bishop, 2006

14 Analyzing the result Analyzing the variance of the predictive distribution Uncertainty in the predicted value due to noise on the target variables (expressed already in ML) Uncertainty in the parameters w (consequence of Bayesian treatment) 14

15 Bayesian Predictive Distribution Important difference to previous example Uncertainty may vary with test point x! 15 Image source: C.M. Bishop, 2006

16 Discussion We now have a better understanding of regression Least-squares regression: Assumption of Gaussian noise We can now also plug in different noise models and explore how they affect the error function. L2 regularization as a Gaussian prior on parameters w. We can now also use different regularizers and explore what they mean. This lecture General formulation with basis functions Á(x). We can now also use different basis functions. 16

17 Discussion General regression formulation In principle, we can perform regression in arbitrary spaces and with many different types of basis functions However, there is a caveat Can you see what it is? Example: Polynomial curve fitting, M = 3 Number of coefficients grows with D M! The approach becomes quickly unpractical for high dimensions. This is known as the curse of dimensionality. We will encounter some ways to deal with this later... 17

18 Topics of This Lecture Recap: Probabilistic View on Regression Properties of Linear Regression Loss functions for regression Basis functions Multiple Outputs Sequential Estimation Regularization revisited Regularized Least-squares The Lasso Discussion Bias-Variance Decomposition 18

19 Loss Functions for Regression Given p(y, x, w, ), how do we actually estimate a function value y t for a new point x t? We need a loss function, just as in the classification case Optimal prediction: Minimize the expected loss Slide adapted from Stefan Roth 19

20 Loss Functions for Regression Simplest case Squared loss: Expected loss, = 2 fy(x) tg p(x; t)dt =! 0 Z tp(x; t)dt = y(x) p(x; t)dt Slide adapted from Stefan Roth 20

21 Loss Functions for Regression Z Z tp(x; t)dt = y(x) p(x; t)dt, y(x) = Z t p(x; t) p(x) dt = Z tp(tjx)dt, y(x) = E[tjx] Important result Under Squared loss, the optimal regression function is the mean E [t x] of the posterior p(t x). Also called mean prediction. For our generalized linear regression function and square loss, we obtain as result Slide adapted from Stefan Roth 21

22 Visualization of Mean Prediction mean prediction 22 Slide adapted from Stefan Roth Image source: C.M. Bishop, 2006

23 Loss Functions for Regression Different derivation: Expand the square term as follows fy(x) tg 2 = fy(x) E[tjx] + E[tjx] tg 2 = fy(x) E[tjx]g 2 + fe[tjx] tg 2 +2fy(x) E[tjx]gfE[tjx] tg Substituting into the loss function The cross-term vanishes, and we end up with Optimal least-squares predictor given by the conditional mean Intrinsic variability of target data Irreducible minimum value of the loss function 23

24 Other Loss Functions The squared loss is not the only possible choice Poor choice when conditional distribution p(t x) is multimodal. Simple generalization: Minkowski loss Expectation E[L q ] = Z Z jy(x) tj q p(x; t)dxdt Minimum of E[L q ] is given by Conditional mean for q = 2, Conditional median for q = 1, Conditional mode for q = 0. 24

25 Minkowski Loss Functions 25 Image source: C.M. Bishop, 2006

26 Topics of This Lecture Recap: Probabilistic View on Regression Properties of Linear Regression Loss functions for regression Basis functions Multiple Outputs Sequential Estimation Regularization revisited Regularized Least-squares The Lasso Discussion Bias-Variance Decomposition 26

27 Linear Basis Function Models Generally, we consider models of the following form where Á j (x) are known as basis functions. Typically, Á 0 (x) = 1, so that w 0 acts as a bias. In the simplest case, we use linear basis functions: Á d (x) = x d. Let s take a look at some other possible basis functions... Slide adapted from C.M. Bishop,

28 Linear Basis Function Models (2) Polynomial basis functions Properties Global A small change in x affects all basis functions. 28 Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

29 Linear Basis Function Models (3) Gaussian basis functions Properties Local A small change in x affects only nearby basis functions. ¹ j and s control location and scale (width). 29 Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

30 Linear Basis Function Models (4) Sigmoid basis functions where Properties Local A small change in x affects only nearby basis functions. ¹ j and s control location and scale (slope). 30 Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

31 Topics of This Lecture Recap: Probabilistic View on Regression Properties of Linear Regression Loss functions for regression Basis functions Multiple Outputs Sequential Estimation Regularization revisited Regularized Least-squares The Lasso Discussion Bias-Variance Decomposition 31

32 Multiple Outputs Multiple Output Formulation So far only considered the case of a single target variable t. We may wish to predict K > 1 target variables in a vector t. We can write this in matrix form where 32

33 Multiple Outputs (2) Analogously to the single output case we have: Given observed inputs,, and targets,, we obtain the log likelihood function Slide adapted from C.M. Bishop,

34 Multiple Outputs (3) Maximizing with respect to W, we obtain If we consider a single target variable, t k, we see that where single output case., which is identical with the Slide adapted from C.M. Bishop,

35 Topics of This Lecture Recap: Probabilistic View on Regression Properties of Linear Regression Loss functions for regression Basis functions Multiple Outputs Sequential Estimation Regularization revisited Regularized Least-squares The Lasso Discussion Bias-Variance Decomposition 35

36 Sequential Learning Up to now, we have mainly considered batch methods All data was used at the same time Instead, we can also consider data items one at a time (a.k.a. online learning) Stochastic (sequential) gradient descent: This is known as the least-mean-squares (LMS) algorithm. Issue: how to choose the learning rate? Slide adapted from C.M. Bishop,

37 Topics of This Lecture Recap: Probabilistic View on Regression Properties of Linear Regression Loss functions for regression Basis functions Multiple Outputs Sequential Estimation Regularization revisited Regularized Least-squares The Lasso Discussion Bias-Variance Decomposition 37

38 Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Regularization Revisited Consider the error function Data term + Regularization term With the sum-of-squares error function and a quadratic regularizer, we get which is minimized by is called the regularization coefficient. Slide adapted from C.M. Bishop,

39 Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Regularized Least-Squares Let s look at more general regularizers L q norms Lasso Ridge Regression 39 Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

40 Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Recall: Lagrange Multipliers 40

41 Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Regularized Least-Squares We want to minimize This is equivalent to minimizing subject to the constraint (for some suitably chosen ) 41

42 Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Regularized Least-Squares Effect: Sparsity for q 1. Minimization tends to set many coefficients to zero Why is this good? Why don t we always do it, then? Any problems? 42 Image source: C.M. Bishop, 2006

43 Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing The Lasso Consider the following regressor This formulation is known as the Lasso. Properties L 1 regularization The solution will be sparse (only few coefficients will be non-zero) The L 1 penalty makes the problem non-linear. There is no closed-form solution. Need to solve a quadratic programming problem. However, efficient algorithms are available with the same computational cost as for ridge regression. 43 Image source: C.M. Bishop, 2006

44 Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Lasso as Bayes Estimation Interpretation as Bayes Estimation We can think of w j q as the log-prior density for w j. Prior for Lasso (q = 1): Laplacian distribution with 44 Image source: Friedman, Hastie, Tibshirani, 2009

45 Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Analysis Equicontours of the prior distribution Analysis For q 1, the prior is not uniform in direction, but concentrates more mass on the coordinate directions. The case q = 1 (lasso) is the smallest q such that the constraint region is convex. Non-convexity makes the optimization problem more difficult. Limit for q = 0: regularization term becomes j=1..m 1 = M. This is known as Best Subset Selection. 45 Image source: Friedman, Hastie, Tibshirani, 2009

46 Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Discussion Bayesian analysis Lasso, Ridge regression and Best Subset Selection are Bayes estimates with different priors. However, derived as maximizers of the posterior. Should ideally use the posterior mean as the Bayes estimate! Ridge regression solution is also the posterior mean, but Lasso and Best Subset Selection are not. We might also try using other values of q besides 0,1,2 However, experience shows that this is not worth the effort. Values of q 2 (1,2) are a compromise between lasso and ridge However, w j q with q > 1 is differentiable at 0. Loses the ability of lasso for setting coefficients exactly to zero. 46

47 Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Topics of This Lecture Recap: Probabilistic View on Regression Properties of Linear Regression Loss functions for regression Basis functions Multiple Outputs Sequential Estimation Regularization revisited Regularized Least-squares The Lasso Discussion Bias-Variance Decomposition 47

48 Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Bias-Variance Decomposition Recall the expected squared loss, where The second term of E[L] corresponds to the noise inherent in the random variable t. What about the first term? Slide adapted from C.M. Bishop,

49 Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Bias-Variance Decomposition Suppose we were given multiple data sets, each of size N. Any particular data set D will give a particular function y(x;d). We then have Taking the expectation over D yields Slide adapted from C.M. Bishop,

50 Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Bias-Variance Decomposition Thus we can write where Slide adapted from C.M. Bishop,

51 Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Bias-Variance Decomposition Example 25 data sets from the sinusoidal, varying the degree of regularization,. 51 Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

52 Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Bias-Variance Decomposition Example 25 data sets from the sinusoidal, varying the degree of regularization,. 52 Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

53 Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing Bias-Variance Decomposition Example 25 data sets from the sinusoidal, varying the degree of regularization,. 53 Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

54 Advanced Perceptual Machine and Sensory Learning Augmented Winter 12 Computing The Bias-Variance Trade-Off Result from these plots An over-regularized model (large ) will have a high bias. An under-regularized model (small ) will have a high variance. We can compute an estimate for the generalization capability this way (magenta curve)! Can you see where the problem is with this? Computation is based on average w.r.t. ensembles of data sets. Unfortunately of little practical value 54 Slide adapted from C.M. Bishop, 2006 Image source: C.M. Bishop, 2006

55 References and Further Reading More information on linear regression, including a discussion on regularization can be found in Chapters and of the Bishop book. Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 T. Hastie, R. Tibshirani, J. Friedman Elements of Statistical Learning 2 nd edition, Springer, 2009 Additional information on the Lasso, including efficient algorithms to solve it, can be found in Chapter 3.4 of the Hastie book. 55

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA Sistemi di Elaborazione dell Informazione Regressione Ruggero Donida Labati Dipartimento di Informatica via Bramante 65, 26013 Crema (CR), Italy http://homes.di.unimi.it/donida

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning

More information

Machine Learning Lecture 2

Machine Learning Lecture 2 Machine Perceptual Learning and Sensory Summer Augmented 6 Computing Announcements Machine Learning Lecture 2 Course webpage http://www.vision.rwth-aachen.de/teaching/ Slides will be made available on

More information

Machine Learning Lecture 2

Machine Learning Lecture 2 Machine Perceptual Learning and Sensory Summer Augmented 15 Computing Many slides adapted from B. Schiele Machine Learning Lecture 2 Probability Density Estimation 16.04.2015 Bastian Leibe RWTH Aachen

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance

More information

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 12 Neural Networks 10.12.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants Advanced Machine Learning Lecture 2 Neural Networks 24..206 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de Talk Announcement Yann LeCun (NYU & FaceBook AI) 28..

More information

Lecture Machine Learning (past summer semester) If at least one English-speaking student is present.

Lecture Machine Learning (past summer semester) If at least one English-speaking student is present. Advanced Machine Learning Lecture 1 Organization Lecturer Prof. Bastian Leibe (leibe@vision.rwth-aachen.de) Introduction 20.10.2015 Bastian Leibe Visual Computing Institute RWTH Aachen University http://www.vision.rwth-aachen.de/

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh September 24 (All of the slides in this course have been adapted from previous versions

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 12 Neural Networks 24.11.2016 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de Talk Announcement Yann LeCun (NYU & FaceBook AI)

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Machine Learning Lecture 2

Machine Learning Lecture 2 Announcements Machine Learning Lecture 2 Eceptional number of lecture participants this year Current count: 449 participants This is very nice, but it stretches our resources to their limits Probability

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Overview Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Probabilistic Interpretation: Linear Regression Assume output y is generated

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Outline lecture 2 2(30)

Outline lecture 2 2(30) Outline lecture 2 2(3), Lecture 2 Linear Regression it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic Control

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization

More information

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis

More information

Machine Learning Lecture 3

Machine Learning Lecture 3 Announcements Machine Learning Lecture 3 Eam dates We re in the process of fiing the first eam date Probability Density Estimation II 9.0.207 Eercises The first eercise sheet is available on L2P now First

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University UNDERDETERMINED LINEAR EQUATIONS We

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning Basics: Maximum Likelihood Estimation Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning

More information

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

COMS 4771 Regression. Nakul Verma

COMS 4771 Regression. Nakul Verma COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the

More information

Machine Learning Lecture 10

Machine Learning Lecture 10 Machine Learning Lecture 10 Neural Networks 26.11.2018 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Today s Topic Deep Learning 2 Course Outline Fundamentals Bayes

More information

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Computer Vision Group Prof. Daniel Cremers. 3. Regression Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the

More information

Kernelized Perceptron Support Vector Machines

Kernelized Perceptron Support Vector Machines Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:

More information

Machine Learning Lecture 14

Machine Learning Lecture 14 Many slides adapted from B. Schiele, S. Roth, Z. Gharahmani Machine Learning Lecture 14 Undirected Graphical Models & Inference 23.06.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Gaussian with mean ( µ ) and standard deviation ( σ)

Gaussian with mean ( µ ) and standard deviation ( σ) Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Machine Learning Lecture 12

Machine Learning Lecture 12 Machine Learning Lecture 12 Neural Networks 30.11.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory Probability

More information

GWAS IV: Bayesian linear (variance component) models

GWAS IV: Bayesian linear (variance component) models GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian

More information

Relevance Vector Machines

Relevance Vector Machines LUT February 21, 2011 Support Vector Machines Model / Regression Marginal Likelihood Regression Relevance vector machines Exercise Support Vector Machines The relevance vector machine (RVM) is a bayesian

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

Outline Lecture 2 2(32)

Outline Lecture 2 2(32) Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic

More information

Today. Calculus. Linear Regression. Lagrange Multipliers

Today. Calculus. Linear Regression. Lagrange Multipliers Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject

More information

Bayesian methods in economics and finance

Bayesian methods in economics and finance 1/26 Bayesian methods in economics and finance Linear regression: Bayesian model selection and sparsity priors Linear Regression 2/26 Linear regression Model for relationship between (several) independent

More information

Bayesian Linear Regression. Sargur Srihari

Bayesian Linear Regression. Sargur Srihari Bayesian Linear Regression Sargur srihari@cedar.buffalo.edu Topics in Bayesian Regression Recall Max Likelihood Linear Regression Parameter Distribution Predictive Distribution Equivalent Kernel 2 Linear

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Gaussian Processes in Machine Learning

Gaussian Processes in Machine Learning Gaussian Processes in Machine Learning November 17, 2011 CharmGil Hong Agenda Motivation GP : How does it make sense? Prior : Defining a GP More about Mean and Covariance Functions Posterior : Conditioning

More information

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Multivariate statistical methods and data mining in particle physics

Multivariate statistical methods and data mining in particle physics Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless

More information

Machine Learning for Signal Processing Bayes Classification

Machine Learning for Signal Processing Bayes Classification Machine Learning for Signal Processing Bayes Classification Class 16. 24 Oct 2017 Instructor: Bhiksha Raj - Abelino Jimenez 11755/18797 1 Recap: KNN A very effective and simple way of performing classification

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Lecture 1b: Linear Models for Regression

Lecture 1b: Linear Models for Regression Lecture 1b: Linear Models for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced

More information

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER 204 205 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE hour Please note that the rubric of this paper is made different from many other papers.

More information

The Bayes classifier

The Bayes classifier The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal

More information

y(x) = x w + ε(x), (1)

y(x) = x w + ε(x), (1) Linear regression We are ready to consider our first machine-learning problem: linear regression. Suppose that e are interested in the values of a function y(x): R d R, here x is a d-dimensional vector-valued

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

Bias-Variance Trade-off in ML. Sargur Srihari

Bias-Variance Trade-off in ML. Sargur Srihari Bias-Variance Trade-off in ML Sargur srihari@cedar.buffalo.edu 1 Bias-Variance Decomposition 1. Model Complexity in Linear Regression 2. Point estimate Bias-Variance in Statistics 3. Bias-Variance in Regression

More information