Cheng Soon Ong & Christian Walder. Canberra February June 2018

Size: px
Start display at page:

Download "Cheng Soon Ong & Christian Walder. Canberra February June 2018"

Transcription

1 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254

2 Part V Linear 1 Least 181of 254

3 t Linear N = 10 x (x 1,..., x N ) T t (t 1,..., t N ) T x i R i = 1,..., N t i R i = 1,..., N Predictor y(x, w)? Performance measure? Optimal solution w? Recall: projection, inverse, eigenvalue decompostion x Least 182of 254

4 Probabilities, Losses Gaussian Distribution Bayes Rule Expected Loss Cross Validation Least 183of 254

5 Maximum Likelihood Consider the linear regression problem, where we have random variables x n and t n. We assume a conditional model t n x n We propose a distribution, parameterized by θ t n x n density(θ) For a given θ the density defines the probability of observing t n x n. We are interested in finding θ that maximises the probability (called the likelihood) of the data. Least 184of 254

6 Given a training data set of N observations {x n } and target values t n. Goal : Learn to predict the value of one ore more target values t given a new value of the input x. Example: Polynomial curve fitting (see Introduction). t Least? x 185of 254

7 Supervised Learning Training Phase training data x model with adjustable parameter w training targets t Least Test Phase fix the most appropriate w* test data x model with fixed parameter w* test target t 186of 254

8 Why Linear? Analytic solution when using least squares loss Well understood statistical behaviour Efficient algorithms exist for convex losses and regularizers But what if the relationship is non-linear? Least of 254

9 Linear combination of fixed nonlinear basis functions M 1 y(x, w) = w j φ j (x) = w T φ(x) j=0 parameter w = (w 0,..., w M 1 ) T basis functions φ(x) = (φ 0 (x),..., φ M 1 (x)) T convention φ 0 (x) = 1 w 0 is the bias parameter Least 188of 254

10 Polynomial Basis Functions Scalar input variable x φ j (x) = x j Limitation : Polynomials are global functions of the input variable x. Extension: Split the input space into regions and fit a different polynomial to each region (spline functions) Least of 254

11 Gaussian Basis Functions Scalar input variable x φ j (x) = exp { (x µ j) 2 } 2s 2 Not a probability distribution. No normalisation required, taken care of by the model parameters w Least of 254

12 Sigmoidal Basis Functions Scalar input variable x ( ) x µj φ j (x) = σ s where σ(a) is the logistic sigmoid function defined by σ(a) = exp( a) σ(a) is related to the hyperbolic tangent tanh(a) by tanh(a) = 2σ(a) Least of 254

13 Other Basis Functions Fourier Basis : each basis function represents a specific frequency and has infinite spatial extent. Wavelets : localised in both space and frequency (also mutually orthogonal to simplify appliciation). Splines (piecewise polynomials restricted to regions of the input space; additional constraints where pieces meet, e.g. smoothness constraints conditions on the derivatives). Least Linear Splines Quadratic Splines Cubic Splines Quartic Splines Approximate the points {(0, 0), (1, 1), (2, 1), (3, 0), (4, 2), (5, 1)} by different splines. 192of 254

14 Least No special assumption about the basis functions φ j (x). In the simplest case, one can think of φ j (x) = x j, or φ(x) = x. Assume target t is given by t = y(x, w) }{{} + }{{} ɛ deterministic noise where ɛ is a zero-mean Gaussian random variable with precision (inverse variance) β. Thus p(t x, w, β) = N (t y(x, w), β 1 ) Least t y(x0) p(t x0) y(x) x0 x 193of 254

15 Least Likelihood of one target t given the data x was p(t x, w, β) = N (t y(x, w), β 1 ) Now, a set of inputs X with corresponding target values t. Assume data are independent and identically distributed (i.i.d.) (means : data are drawn independent and from the same distribution). The likelihood of the target t is then p(t X, w, β) = = N N (t n y(x n, w), β 1 ) n=1 N N (t n w T φ(x n ), β 1 ) n=1 From now on drop the conditioning variable X from the notation, as with supervised learning we do not seek to model the distribution of the input data. Least 194of 254

16 Least Consider the logarithm of the likelihood p(t w, β) (the logarithm is a monotone function! ) ln p(t w, β) = = N ln N (t n w T φ(x n ), β 1 ) n=1 ( N β ln { 2π exp β2 } ) (t n w T φ(x n )) 2 n=1 = N 2 ln β N 2 ln(2π) βe D(w) where the sum-of-squares error function is E D (w) = 1 2 N {t n w T φ(x n )} 2. n=1 Least arg max w ln p(t w, β) arg min w E D (w) 195of 254

17 Least Goal: Find a more compact representation. Rewrite the error function E D (w) = 1 N {t n w T φ(x n )} 2 = (t Φw)T (t Φw) n=1 where t = (t 1,..., t N ) T, and φ 0 (x 1 ) φ 1 (x 1 )... φ M 1 (x 1 ) φ 0 (x 2 ) φ 1 (x 2 )... φ M 1 (x 2 ) Φ = φ 0 (x N ) φ 1 (x N )... φ M 1 (x N ) Least 196of 254

18 Least The log likelihood is now ln p(t w, β) = N 2 ln β N 2 ln(2π) βe D(w) = N 2 ln β N 2 ln(2π) β 1 2 (t Φw)T (t Φw) Find critical points of ln p(t w, β). The gradient with respect to w is w ln p(t w, β) = βφ T (t Φw). Setting the gradient to zero gives which results in 0 = Φ T t Φ T Φw, Least w ML = (Φ T Φ) 1 Φ T t = Φ t where Φ is the Moore-Penrose pseudo-inverse of the matrix Φ. 197of 254

19 Least The log likelihood with the optimal w ML is now ln p(t w ML, β) = N 2 ln β N 2 ln(2π) β 1 2 (t Φw ML) T (t Φw ML ) Find critical points of ln p(t w, β) wrt β, results in ln p(t w ML, β) β = 0 1 β ML = 1 N (t Φw ML) T (t Φw ML ) Note: We can first find the maximum likelihood for w as this does not depend on β. Then we can use w ML to find the maximum likelihood solution for β. Could we have chosen optimisation wrt β first, and then wrt to w? Least 198of 254

20 Target vector t = (t 1,..., t N ) T R N Basis vectors ϕ j = (Φ j1,..., Φ jn ) T = (φ j (x 1 ),..., φ j (x N )) T R N span a subspace S Find w such that y = (y(x 1, w),..., y(x N, w)) T S is closest to t. S ϕ 2 ϕ 1 y t Least 199of 254

21 - Stochastic Gradient Descent For large data sets, calculating the maximum likelihood parameters w ML and β ML may be costly. For online applications, never all data in memory. Use a sequential algorithms (online algorithm). If the error function is a sum over data points E = n E n, then 1 initialise w (0) to some starting value 2 update the parameter vector at iteration τ + 1 by w (τ+1) = w (τ) η E n, where E n is the error function after presenting the nth data set, and η is the learning rate. Least 200of 254

22 - Stochastic Gradient Descent For the sum-of-squares error function, stochastic gradient descent results in ( ) w (τ+1) = w (τ) η t n w (τ)t φ(x n ) φ(x n ) The value for the learning rate must be chosen carefully. A too large learning rate may prevent the algorithm from converging. A too small learning rate does follow the data too slowly. Least 201of 254

23 Add regularisation in order to prevent overfitting E D (w) + λe W (w) with regularisation coefficient λ. Simple quadratic regulariser E W (w) = 1 2 wt w Maximum likelihood solution w = ( λi + Φ T Φ ) 1 Φ T t Least 202of 254

24 More general regulariser E W (w) = 1 2 M w j q q = 1 (lasso) leads to a sparse model if λ large enough. j=1 Least q = 0.5 q = 1 q = 2 q = 4 203of 254

25 Comparison of Quadratic and Lasso Regulariser Assume a sufficiently large regularisation coefficient λ. Quadratic regulariser w M j=1 w w 2 j Lasso regulariser w w M w j j=1 Least w 1 w 1 204of 254

26 More than 1 target variable per data point. y becomes a vector instead of a scalar. Each dimension can be treated with a different set of basis functions (and that may be necessary if the data in the different target dimensions represent very different types of information.) Here we restrict ourselves to the SAME basis functions y(x, w) = W T φ(x) where y is a K-dimensional column vector, W is an M K matrix of model parameters, and φ(x) = (φ 0 (x),..., φ M 1 (x), φ 0 (x) = 1, as before. Define target matrix T containing the target vector t T n in the n th row. Least 205of 254

27 Suppose the conditional distribution of the target vector is an isotropic Gaussian of the form p(t x, W, β) = N (t W T φ(x), β 1 I). The log likelihood is then ln p(t X, W, β) = N ln N (t n W T φ(x n ), β 1 VI) n=1 = NK 2 ln ( β 2π ) β 2 N t n W T φ(x n ) 2 n=1 Least 206of 254

28 Maximisation with respect to W results in W ML = (Φ T Φ) 1 Φ T T. For each target variable t k, we get w k = (Φ T Φ) 1 Φ T t k = Φ t k. The solution between the different target variables decouples. Holds also for a general Gaussian noise distribution with arbitrary covariance matrix. Why? W defines the mean of the Gaussian noise distribution. And the maximum likelihood solution for the mean of a multivariate Gaussian is independent of the covariance. Least 207of 254

29 Over-fitting results from a large number of basis functions and a relatively small training set. Regularisation can prevent overfitting, but how to find the correct value for the regularisation constant λ? Frequentists viewpoint of the model complexity is the bias-variance trade-off. Least 208of 254

30 Choose an estimator y(x) to estimate the target value t for each input x. Choose a loss function L(t, y(x)) which measures the difference between the target t and the estimate y(x). The expected loss is then E [L] = L(t, y(x)) p(x, t) dx dt Common choice: Squared Loss L(t, y(x)) = {y(x) t} 2. Expected loss for squared loss function E [L] = {y(x) t} 2 p(x, t) dx dt. Least 209of 254

31 Expected loss for squared loss function E [L] = {y(x) t} 2 p(x, t) dx dt. Minimise E [L] by choosing the regression function t p(x, t) dt y(x) = = t p(t x) dt = E t [t x] p(x) (use calculus of variations to derive this result). Least 210of 254

32 The regression function which minimises the expected squared loss, is given by the mean of the conditional distribution p(t x). t y(x) Least y(x 0 ) p(t x 0 ) x 0 x 211of 254

33 Analyse the expected loss E [L] = {y(x) t} 2 p(x, t) dx dt. Rewrite the squared loss Claim {y(x) t} 2 = {y(x) E [t x] + E [t x] t} 2 = {y(x) E [t x]} 2 + {E [t x] t} {y(x) E [t x]} {E [t x] t} {y(x) E [t x]} {E [t x] t} p(x, t) dx dt = 0. Least 212of 254

34 Claim {y(x) E [t x]} {E [t x] t} p(x, t) dx dt = 0. Seperate functions depending on t from function depending on x ( ) {y(x) E [t x]} {E [t x] t} p(x, t) dt dx Calculate the integral over t t p(x, t) {E [t x] t} p(x, t) dt = E [t x] p(x) p(x) dt p(x) = E [t x] p(x) p(x)e [t x] = 0 Least 213of 254

35 The expected loss is now E [L] = {y(x) E [t x]} 2 p(x) dx + var[t x] p(x) dx Least Minimise first term by choosing appropriate y(x). Second term represents the intrinsic variability of the target data (can be regarded as noise). Independent of the choice y(x), can not be reduced by learning a better y(x). 214of 254

36 Consider now the dependency on the data set D. Prediction function now y(x; D). Consider again squared loss for which the optimal prediction is given by the conditional expectation h(x) h(x) = E [t x] = t p(t x) dt. BUT: we can not know h(x) exactly, as we would need an infinite number of training data to learn it accurately. Evaluate performance of algorithm by taking the expectation E D [L] over all data sets D Least 215of 254

37 Taking the expectation over all data sets D [ E D [E [L]] = E D {y(x; D) h(x)} 2 ] p(x) dx + {h(x) t} 2 p(x, t) dx dt Again, add and subtract the expectation E D [y(x; D)] {y(x; D) h(x)} 2 = { y(x; D) E D [y(x; D)] + E D [y(x; D)] h(x)} 2 and show that the mixed term does vanish under the expectation E D [...]. Least 216of 254

38 Expected loss E D [L] over all data sets D expected loss = (bias) 2 + variance + noise. where (bias) 2 = {E D [y(x; D)] h(x)} 2 p(x) dx variance = E D [{y(x; D) E D [y(x; D)]} 2] p(x) dx noise = {h(x) t} 2 p(x, t) dx dt. variance : How sensitive is the model to small changes in the training set? (How much do solutions for individual data sets vary around their average? squared bias : How accurate is a model across different training sets? (How much does the average prediction over all data sets differ from the desired regression function?) Least 217of 254

39 Simple models have low variance and high bias. 1 t ln λ = t Least x 1 0 x 1 Left: Result of fitting the model to 100 data sets, only 25 shown. Right: Average of the 100 fits in red, the sinusoidal function from where the data were created in green. 218of 254

40 Dependence of bias and variance on the model complexity 1 t ln λ = t Least x 1 0 x 1 Left: Result of fitting the model to 100 data sets, only 25 shown. Right: Average of the 100 fits in red, the sinusoidal function from where the data were created in green. 219of 254

41 Complex models have high variance and low bias. 1 t ln λ = t Least x 1 0 x 1 Left: Result of fitting the model to 100 data sets, only 25 shown. Right: Average of the 100 fits in red, the sinusoidal function from where the data were created in green. 220of 254

42 Squared bias, variance, their sum, and test data The minimum for (bias) 2 + variance occurs close to the value that gives the minimum error (bias) 2 variance (bias) 2 + variance test error Least ln λ 221of 254

43 Tradeoff between bias and variance simple models have low variance and high bias complex models have high variance and low bias The sum of bias and variance has a minimum at a certain model complexity. Expected loss E D [L] over all data sets D expected loss = (bias) 2 + variance + noise. The noise comes from the data, and can not be removed from the expected loss. To analyse the bias-variance decomposition : many data sets needed, which are not always available. Least 222of 254

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Machine Learning. 7. Logistic and Linear Regression

Machine Learning. 7. Logistic and Linear Regression Sapienza University of Rome, Italy - Machine Learning (27/28) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Machine Learning 7. Logistic and Linear Regression Luca Iocchi,

More information

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met c Outlines Statistical Group and College of Engineering and Computer Science Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 89 Part II

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 The Regression Problem Training data: A set of input-output

More information

Bayesian Linear Regression. Sargur Srihari

Bayesian Linear Regression. Sargur Srihari Bayesian Linear Regression Sargur srihari@cedar.buffalo.edu Topics in Bayesian Regression Recall Max Likelihood Linear Regression Parameter Distribution Predictive Distribution Equivalent Kernel 2 Linear

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA Sistemi di Elaborazione dell Informazione Regressione Ruggero Donida Labati Dipartimento di Informatica via Bramante 65, 26013 Crema (CR), Italy http://homes.di.unimi.it/donida

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer. University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2017 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2017 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 679 Part XIX

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 143 Part IV

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Bias-Variance Trade-off in ML. Sargur Srihari

Bias-Variance Trade-off in ML. Sargur Srihari Bias-Variance Trade-off in ML Sargur srihari@cedar.buffalo.edu 1 Bias-Variance Decomposition 1. Model Complexity in Linear Regression 2. Point estimate Bias-Variance in Statistics 3. Bias-Variance in Regression

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Overview Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Probabilistic Interpretation: Linear Regression Assume output y is generated

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

More information

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Computer Vision Group Prof. Daniel Cremers. 3. Regression Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the

More information

Iterative Reweighted Least Squares

Iterative Reweighted Least Squares Iterative Reweighted Least Squares Sargur. University at Buffalo, State University of ew York USA Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College

More information

Outline Lecture 2 2(32)

Outline Lecture 2 2(32) Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic

More information

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis

More information

Outline lecture 2 2(30)

Outline lecture 2 2(30) Outline lecture 2 2(3), Lecture 2 Linear Regression it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic Control

More information

Ridge Regression. Mohammad Emtiyaz Khan EPFL Oct 1, 2015

Ridge Regression. Mohammad Emtiyaz Khan EPFL Oct 1, 2015 Ridge Regression Mohammad Emtiyaz Khan EPFL Oct, 205 Mohammad Emtiyaz Khan 205 Motivation Linear models can be too limited and usually underfit. One way is to use nonlinear basis functions instead. Nonlinear

More information

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh September 24 (All of the slides in this course have been adapted from previous versions

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Lecture 1b: Linear Models for Regression

Lecture 1b: Linear Models for Regression Lecture 1b: Linear Models for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive

More information

1. Non-Uniformly Weighted Data [7pts]

1. Non-Uniformly Weighted Data [7pts] Homework 1: Linear Regression Writeup due 23:59 on Friday 6 February 2015 You will do this assignment individually and submit your answers as a PDF via the Canvas course website. There is a mathematical

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

More information

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA Logistic Regression Sargur N. University at Buffalo, State University of New York USA Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative 1. Fixed basis

More information

Lecture 5: GPs and Streaming regression

Lecture 5: GPs and Streaming regression Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

More information

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop Bayesian Gaussian / Linear Models Read Sections 2.3.3 and 3.3 in the text by Bishop Multivariate Gaussian Model with Multivariate Gaussian Prior Suppose we model the observed vector b as having a multivariate

More information

Linear Regression and Discrimination

Linear Regression and Discrimination Linear Regression and Discrimination Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian

More information

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane DATA MINING AND MACHINE LEARNING Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Linear models for regression Regularized

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Learning from Data: Regression

Learning from Data: Regression November 3, 2005 http://www.anc.ed.ac.uk/ amos/lfd/ Classification or Regression? Classification: want to learn a discrete target variable. Regression: want to learn a continuous target variable. Linear

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Multiclass Logistic Regression

Multiclass Logistic Regression Multiclass Logistic Regression Sargur. Srihari University at Buffalo, State University of ew York USA Machine Learning Srihari Topics in Linear Classification using Probabilistic Discriminative Models

More information

Machine Learning and Data Mining. Linear regression. Kalev Kask

Machine Learning and Data Mining. Linear regression. Kalev Kask Machine Learning and Data Mining Linear regression Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ Parameters q Learning algorithm Program ( Learner ) Change q Improve performance

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

NEURAL NETWORKS

NEURAL NETWORKS 5 Neural Networks In Chapters 3 and 4 we considered models for regression and classification that comprised linear combinations of fixed basis functions. We saw that such models have useful analytical

More information

Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP Predic?ve Distribu?on (1) Predict t for new values of x by integra?ng over w: where The Evidence Approxima?on (1) The

More information

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision

More information

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016 Lecture 7 Logistic Regression Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 11, 2016 Luigi Freda ( La Sapienza University) Lecture 7 December 11, 2016 1 / 39 Outline 1 Intro Logistic

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Bayesian Decision Theory

Bayesian Decision Theory Bayesian Decision Theory Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 1 / 46 Bayesian

More information

Classification. Sandro Cumani. Politecnico di Torino

Classification. Sandro Cumani. Politecnico di Torino Politecnico di Torino Outline Generative model: Gaussian classifier (Linear) discriminative model: logistic regression (Non linear) discriminative model: neural networks Gaussian Classifier We want to

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

10-701/ Machine Learning, Fall

10-701/ Machine Learning, Fall 0-70/5-78 Machine Learning, Fall 2003 Homework 2 Solution If you have questions, please contact Jiayong Zhang .. (Error Function) The sum-of-squares error is the most common training

More information

Bayesian Linear Regression [DRAFT - In Progress]

Bayesian Linear Regression [DRAFT - In Progress] Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information