Bayesian Linear Regression. Sargur Srihari

Size: px
Start display at page:

Download "Bayesian Linear Regression. Sargur Srihari"

Transcription

1 Bayesian Linear Regression Sargur

2 Topics in Bayesian Regression Recall Max Likelihood Linear Regression Parameter Distribution Predictive Distribution Equivalent Kernel 2

3 Linear Regression: model complexity M Polynomial regression y(x, w) = w 0 + w 1 x + w 2 x w M x M = w j x j Red lines are best fits with M = 0,1,3,9 and =10 M j =0 Poor representations of sin(2πx) Best Fit to sin(2πx) Over Fit Poor representation of sin(2πx) 3

4 Max Likelihood Regression Input vector x, basis functions {ϕ 1 (x),.., ϕ M (x)}: y(x,w) = M 1 j =0 w j φ j (x) = w T φ(x) Objective Function: Closed-form ML solution is: w ML = (Φ T Φ) 1 Φ T t w ML = (λi + Φ T Φ) 1 Φ T t Gradient Descent: { } Radial basis fns: Max Likelihood objective with examples {x 1,..x }: (equivalent to Mean Squared Error Objective) Regularized solution is: Regularized MSE with examples: (λ is the regularization coefficient) E(w) = 1 2 E(w) = 1 2 where Φ is the design matrix: (Φ T Φ) is Moore-Penrose inverse w (τ +1) = w (τ ) η E φ j (x) = exp 1 2 (x µ j )t Σ 1 (x µ j ) { t n w T φ(x n ) } { t n w T φ(x n ) } 2 + λ 2 wt w φ 0(x1) φ 0(x 2) Φ = φ 0(x ) { } E = t n w (τ)t φ(x n ) φ(x n ) Regularized version: E = t n w (τ )T φ(x n ) φ(x n ) λw (τ ) 2 φ (x 1 1 )... φ φ M 1 M 1 (x1) (x ) 4

5 Shortcomings of MLE M.L.E. of parameters w does not address M (Model complexity: how many basis functions? It is controlled by data size More data allows better fit without overfitting Regularization also controls overfit (λ controls effect) E(w) = E D (w) + λe W (w) where E D (w) = 1 2 { t n w T φ(x n ) } But M and choice of ϕ j are still important M can be determined by holdout, but wasteful of data E W (w) = 1 2 wt w Model complexity and over-fitting are better handled using Bayesian approach 5 2

6 Bayesian Linear Regression Using Bayes rule, posterior is proportional to Likelihood Prior: p(w t) = where p(t w) is the likelihood of observed data p(w) is prior distribution over the parameters We will look at: p(t w)p(w) p(t) A normal distribution for prior p(w) Likelihood p(t w) is a product of Gaussians based on the noise model And conclude that posterior is also Gaussian 6

7 Gaussian Prior Parameters Assume multivariate Gaussian prior for w (which has components w 0,..,w M ) p(w) = (w m 0, S 0 ) with mean m 0 and covariance matrix S 0 If we choose S 0 = α I it means that the variances of the weights are all equal to α and covariances are zero p(w) with zero mean (m 0 =0) and isotropic over weights (same variances) w 1 w 0 7

8 Likelihood of Data is Gaussian Assume noise precision parameter β t = y(x,w)+ε where ε is defined probabilistically as Gaussian noise p(t x,w,β)=(t y(x,w),β ) ote that output t is a scalar Likelihood of t ={t 1,..,t } is then p(t X,w, β) = ( t n w T φ(x n ), β ) 1 This is the probability of target data t given the parameters w and input X={x 1,..,x } Due to Gaussian noise, likelihood p(t w) is also a Gaussian 8

9 Posterior Distribution is also Gaussian Prior: p(w)~ (w m 0, S 0 ) i.e., it is Gaussian Likelihood comes from Gaussian noise p(t X,w, β) = It follows that posterior p(w t) is also Gaussian Proof: use standard result from Gaussians: If marginal p(w) & conditional p(t w) have Gaussian forms then the marginals p(t) and p(w t) are also Gaussian: Let p(w) = (w µ,λ ) and p(t w) = (t Aw+b,L ) Then marginal p(t) = (t Aµ+b,L +AΛ A T ) and conditional p(w t) = (w Σ{A t L(t-b)+Λµ},Σ) where Σ=(Λ+A T LA) ( ) t n w T φ(x n ), β 1 9

10 Exact form of Posterior Distribution We have p(w)= (w m 0, S 0 ) & p(t X,w, β) = ( t n w T φ(x n ), β ) 1 Posterior is also Gaussian, written directly as p(w t)=(w m,s ) where m is the mean of the posterior given by m = S (S 0 m 0 + β Φ T t) Φ is the design matrix and S is the covariance matrix of posterior given by S = S 0 + β Φ T Φ φ 0(x1) φ 0(x 2) Φ = φ 0(x ) φ (x ) φ M 1(x 1) φ M 1(x ) w 1 Prior p(w α) = (w 0,α 1 I ) and Posterior in weight space for scalar input x and y(x,w)=w 0 +w 1 x w 0 w 1 w 1 w 0 10

11 Properties of Posterior 1. Since posterior p(w t)=(w m,s ) is Gaussian its mode coincides with its mean Thus maximum posterior weight is w MAP = m 2. Infinitely broad prior S 0 =α I, i.e.,precision α à0 Then mean m reduces to the maximum likelihood value, i.e., mean is the solution vector w ML = (Φ T Φ) 1 Φ T t 3. If = 0, posterior reverts to the prior 4. If data points arrive sequentially, then posterior to any stage acts as prior distribution for subsequent data points 11

12 Choose a simple Gaussian prior p(w) y(x,w)=w 0 +w 1 x p(w α)~(0,1/α) Zero mean (m 0 =0) isotropic (same variances) Gaussian p(w α) (w 0,α 1 I) Corresponding posterior distribution is p(w t)=(w m,s ) where m =β S Φ T t and S =α I+β Φ T Φ ote: β is noise precision and α is variance of parameter w in prior w 1 Single precision parameter α Point Estimate with infinite samples w 0 12

13 Equivalence to MLE with Regularization Since we have Log of Posterior is ln p(w t) = β 2 ( ) and { t n w T φ(x n )} Thus Maximization of posterior is equivalent to minimization of sum-of-squares error { t n w T φ(x n ) } 2 + λ 2 wt w E(w) = 1 2 p(t X,w, β) = t n w T φ(x n ), β 1 p(w α) = (w 0,α 1 I ) p(w t) = ( ) t n w T φ(x n ), β 1 (w 0,α 1 I) with addition of quadratic regularization term w T w with λ = α /β 2 α 2 wt w + const 13

14 Bayesian Linear Regression Example (Straight Line Fit) Single input variable x Single target variable t Goal is to fit Linear model y(x,w) = w 0 + w 1 x Goal of Linear Regression is to recover w =[w 0,w 1 ] given the samples t x 14

15 Data Generation Synthetic data generated from f(x,w)=w 0 +w 1 x with parameter values w 0 = -0.3 and w 1 =0.5 t x 1 First choose x n from U(x,1), then evaluate f(x n,w) Add Gaussian noise with st dev 0.2 to get target t n Precision parameter β = (1/0.2 ) 2 = 25 For prior over w we choose α = 2 p(w α) = (w 0,α 1 I ) 15

16 Sampling p(w) and p(w t) Each sample represents a straight line in data space (modified by examples) Distribution y(x,w)=w 0 +w 1 x Six samples w 1 p(w) With no examples: w 0 p(w t) With two examples: Goal of Bayesian Linear Regression: Determine p(w t) 16

17 Sequential Bayesian Learning Machine Learning Since there are only two parameters We can plot prior and posterior distributions in parameter space We look at sequential update of posterior We are plotting p(w t) for a single data point Before data points observed After first data point (x,t) observed Band represents values of w 0, w 1 representing st lines going near data point x Likelihood for 2 nd point alone Likelihood p(t x.w) as function of w True parameter Value X Prior/ Posterior p(w) gives p(w t) Six samples (regression functions) corresponding to y(x,w) with w drawn from posterior o Data Point First Data Point (x 1,t 1 ) Second Data Point With infinite points posterior is a delta function centered at true parameters (white cross) Likelihood for 20 th point alone Twenty Data Points 17

18 Generalization of Gaussian prior The Gaussian prior over parameters is p(w α) = (w 0,α 1 I) Maximization of posterior ln p(w t) is equivalent to minimization of sum of squares error E(w) = 1 { t 2 n w T φ(x n ) } 2 + λ 2 wt w Other prior yields Lasso and variations: p(w α) = q α 2 2 q=2 corresponds to Gaussian 1/q 1 Γ(1 /q) M exp α 2 w j q Corresponds to minimization of regularized error function M t n w T φ(x n ) w j q 1 2 { } 2 + λ 2 j =1 M j =1 18

19 Predictive Distribution Usually not interested in the value of w itself But predicting t for a new value of x p(t t,x,x) or p(t t) Leaving out conditioning variables X and x for convenience Marginalizing over parameter variable w, is the standard Bayesian approach Sum rule of probability We can now write p(t)= p(t,w)dw = p(t w)p(w)dw p(t t)= p(t w)p(w t)dw 19

20 Predictive Distribution with α, β,x,t We can predict t for a new value of x using p(t t)= p(t w)p(w t)dw With explicit dependence on prior parameter α, noise parameter β, & targets in training set t p(t t,α,β)= We have left out conditioning variables X and x for convenience. Also we have applied sum rule of probability p(t)=σ w p(t w)p(w) p(t w,β) p(w t,α,β)d w Conditional of target t given weight w p(t x,w, β) = (t y(x,w), β 1 ) posterior of weight w p(w t)=(w m,s ) where m =β S Φ T t S =α I+β Φ T Φ RHS is a convolution of two Gaussian distributions whose result is the Gaussian: p(t x,t,α, β) = (t m T φ(x),σ 2 (x)) where σ 2 (x) = 1 β + φ(x)t S φ(x)

21 Variance of Predictive Distribution Predictive distribution is a Gaussian: p(t x, t,α, β) = (t m T φ(x),σ 2 (x)) where σ 2 (x) = 1 β + φ(x)t S φ(x) Since noise process and distribution of w are independent Gaussians their variances are additive oise in data Uncertainty associated with parameters w: where S is the covariance of p(w α) =α I+β Φ T Φ 2 Since σ +1 (x) σ 2 (x) as no. of samples increases it becomes narrower As à, second term of variance goes to zero and variance of predictive distribution arises solely from the additive noise parameter β 21

22 Example of Predictive Distribution Data generated from sin(2πx) Model: nine Gaussian basis functions y(x,w) = 8 j=0 w j φ j (x) = w T φ(x) Predictive distribution p(t x, t,α, β) = (t m T φ(x),σ 2 (x)) φ j (x) = exp (x µ j )2 2σ 2 where σ 2 (x) = 1 β + φ(x)t S φ(x) Plot of p(t x) for one data point showing mean (red) and one std dev (pink) where m =β S Φ T t, S =α I+β Φ T Φ and α and β come from assumptions p(w α)= (w 0, α I ) p(t x,w, β) = (t y(x,w), β 1 ) Mean of Predictive Distribution 22

23 Predictive Distribution Variance Bayesian prediction: p(t x, t,α, β) = (t m T φ(x),σ 2 (x))where σ 2 (x) = 1 β + φ(x)t S φ(x) where we have assumed =1 =2 Gaussian prior over parameters: p(w α) = (w 0,α 1 I) oise model assumed Gaussian: p(t x,w,β)=(t y(x,w),β ) and use design matrix as: S =α I+β Φ T Φ 0 Using data from sin(2πx): φ 0(x1) φ 0(x 2) Φ = φ (x ) φ (x ) σ 2 (x), std dev of t, is smallest in neighborhood of data points φ M 1(x 1) φ M 1(x ) Mean of the Gaussian Predictive Distribution One standard deviation from Mean =4 =25 Uncertainty decreases as more data points are observed Plot only shows point-wise predictive variance To show covariance between predictions at different values of x draw samples from posterior distribution over w p(w t) and plot corresponding functions y(x,w)

24 Plots of function y(x,w) Draw samples w from from posterior distribution p(w t) p(w t)=(w m,s ) and plot samples from y(x,w) = w T ϕ(x) Shows covariance between predictions at different values of x For a given function, for a pair of x,x, the values of y,y are determined by k(x,x ) which in turn is determined by the samples =1 =2 =4 =25 24

25 Disadvantage of Local Basis Predictive distribution, assuming Gaussian prior p(w α) = (w 0,α 1 I) and Gaussian noise t = y(x,w)+ε where noise is defined probabilistically as p(t x,w,β)=(t y(x,w),β ) p(t x, t,α, β) = (t m T φ(x),σ 2 (x)) where σ 2 (x) = 1 β + φ(x)t S φ(x) S =α I+β Φ T Φ With localized basis functions, e.g., Gaussian at regions away from basis function centers, contribution of second term of variance σ n 2 in will go to zero leaving only noise contribution β Model becomes very confident outside of region occupied by basis functions Problem avoided by alternative Bayesian approach of Gaussian Processes 25

26 Dealing with unknown β If both w and β are treated as unknown then we can introduce a conjugate prior distribution p(w,β) which is given by a Gaussian-gamma distribution p(µ,λ) = µ µ 0, βλ ( ) 1 Gam λ a,b ( ) In this case the predictive distribution is a Student s t-distribution St( x µ, λ, ν ) = Γ( ν / 2 + 1/ 2) Γ( ν + 2) λ πν 1/ λ( x µ ) ν 2 ν / 2 1/ 2 26

27 Mean of p(w t) has Kernel Interpretation Regression function is: y(x,w) = M 1 j =0 If we take a Bayesian approach with Gaussian prior p(w)=(w m 0, S 0 ) then we have: Posterior p(w t)= (w m,s ) where m = S (S 0 m 0 + βφ T t) S = S 0 + βφ T Φ With zero mean isotropic p(w α)= (w 0, α I) m = β S Φ T t, w j φ j (x) = w T φ(x) S = α I+ β Φ T Φ Posterior mean β S Φ T t has a kernel interpretation Sets stage for kernel methods and Gaussian processes 27

28 Equivalent Kernel Posterior mean of w is m =βs Φ T t where S = S 0 + βφ T Φ, S 0 is the covariance matrix of the prior p(w), β is the noise parameter and Φ is the design matrix that depends on the samples Substitute mean value into Regression function y(x,w) = M 1 j =0 w j φ j (x) = w T φ(x) Mean of predictive distribution at point x is y(x,m ) = m T φ(x) = βφ(x) T S Φ T t = βφ(x) T S φ(x n )t n = k(x,x n )t n where k(x,x )=βϕ (x) T S ϕ (x ) is the equivalent kernel Thus mean of predictive distribution is a linear combination of training set target variables t n ote: the equivalent kernel depends on input values x n from the dataset because they appear in S

29 Kernel Function Regression functions such as y(x,m ) = k(x,x n )t n k(x,x )=βϕ (x) T S ϕ (x ) S = S 0 + βφ T Φ That take a linear combination of the training set target values are known as linear smoothers φ 0(x1) φ 0(x 2) Φ = φ 0(x ) They depend on the input values x n from the data set since they appear in the definition of S 29 φ (x ) φ M 1(x 1) φ M 1(x )

30 Example of kernel for Gaussian Basis Equivalent Kernel k(x, x )=ϕ(x) T S ϕ(x ) S = S 0 + βφ T Φ φ 0(x1) φ 0(x 2) Φ = φ 0(x ) φ (x 1 1 )... φ φ M 1 M 1 (x1) (x ) For three values of x the behavior of k(x,x ) is shown as a slice Kernel used directly in regression. Mean of the predictive distribution is Kernels are localized around x, i.e., peaks when x =x y(x,m ) = k(x,x n )t n Obtained by forming a weighted combination of target values: Data points close to x are given higher weight than points further removed from x Gaussian Basis ϕ(x) x φ j (x) = exp (x µ j )2 2s 2 x Plot of k(x,x ) shown as a function of x and x Peaks when x=x Data set used to generate kernel were 200 values of x equally spaced in (,1) 30

31 Equivalent Kernel for Polynomial Basis Function ϕ j (x)=x j k(x,x )=βϕ (x) T S ϕ (x ) S = S 0 + β Φ T Φ Data points close to x are given higher weight than points further removed from x Plotted as a function of x for x=0 Localized function of x even though corresponding basis function is nonlocal 31

32 Equivalent Kernel for Sigmoidal Basis Function φ j (x) = σ x µ j s where σ(a) = exp( a) k(x,x )=βϕ(x) T S ϕ(x ) Localized function of x even though corresponding basis function is nonlocal 32

33 Covariance between y(x) and y(x ) An important insight: The value of the kernel function between two points is directly related to the covariance between their target values cov [y(x), y(x )] = cov[ϕ(x) T w, w T ϕ (x )] = ϕ (x) T S ϕ (x ) = β k(x, x ) From the form of the equivalent kernel k(x, x ) the predictive mean at nearby points y(x), y(x ) will be highly correlated For more distant pairs correlation is smaller The kernel captures the covariance where we have used: p(w t)~(w m,s ) k(x, x )= βϕ(x) T S ϕ (x) 33

34 Predictive plot vs. Posterior plots Predictive distribution allows us to visualize pointwise uncertainty in the predictions governed by p(t x, t,α, β) = (t m T φ(x),σ 2 (x)) where σ 2 (x) = 1 β + φ(x)t S φ(x) Drawing samples from posterior p(w t) Plotting corresponding functions y(x,w) we visualize joint uncertainty in the posterior distribution between the y values at two or more x values as governed by the kernel 34

35 Directly Specifying Kernel Function Formulation of Linear Regression in terms of kernel function suggests an alternative approach to regression: Instead of introducing a set of basis functions, which implicitly determines an equivalent kernel: Directly define kernel functions and use to make predictions for new input x, given the observation set This leads to a practical framework for regression (and classification) called Gaussian Processes 35

36 Summing Kernel Values Over samples Effective kernel defines weights by which target values combined to make a prediction at x It can be shown that weights sum to one, i.e., For all values of x k(x,x n ) = 1 This result can be proven intuitively: k(x, x )=ϕ(x) T S ϕ(x ) S = S 0 + βφ T Φ Since y(x, m ) = k(x,x n )t n summation is equivalent to considering predictive mean ŷ(x) for a set of integer data in which t n =1 for all n Provided basis functions are linearly independent, that >M, one of the basis functions is constant (corresponding to the bias parameter), then we can fit 36 training data exactly, and hence ŷ(x)=1

37 Kernel Function Properties Equivalent kernel can be positive or negative Although it satisfies a summation constraint, the corresponding predictions are not necessarily a convex combination of the training set target variables Equivalent kernel satisfies important property shared by kernel functions in general. It can be expressed in the form of an inner product wrt a vector ψ(x) of nonlinear functions: k(x,z) = ψ(x) T ψ(z) k(x, x )=ϕ(x) T S ϕ(x ) where ψ(x) = β 1/2 S 1/2 φ(x) 37

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

Outline Lecture 2 2(32)

Outline Lecture 2 2(32) Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic

More information

The Laplace Approximation

The Laplace Approximation The Laplace Approximation Sargur N. University at Buffalo, State University of New York USA Topics in Linear Models for Classification Overview 1. Discriminant Functions 2. Probabilistic Generative Models

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Outline lecture 2 2(30)

Outline lecture 2 2(30) Outline lecture 2 2(3), Lecture 2 Linear Regression it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic Control

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 The Regression Problem Training data: A set of input-output

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

More information

Bayesian Linear Regression [DRAFT - In Progress]

Bayesian Linear Regression [DRAFT - In Progress] Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory

More information

Iterative Reweighted Least Squares

Iterative Reweighted Least Squares Iterative Reweighted Least Squares Sargur. University at Buffalo, State University of ew York USA Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative

More information

Machine Learning Srihari. Probability Theory. Sargur N. Srihari

Machine Learning Srihari. Probability Theory. Sargur N. Srihari Probability Theory Sargur N. Srihari srihari@cedar.buffalo.edu 1 Probability Theory with Several Variables Key concept is dealing with uncertainty Due to noise and finite data sets Framework for quantification

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

INTRODUCTION TO PATTERN RECOGNITION

INTRODUCTION TO PATTERN RECOGNITION INTRODUCTION TO PATTERN RECOGNITION INSTRUCTOR: WEI DING 1 Pattern Recognition Automatic discovery of regularities in data through the use of computer algorithms With the use of these regularities to take

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh September 24 (All of the slides in this course have been adapted from previous versions

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak Gaussian processes and bayesian optimization Stanisław Jastrzębski kudkudak.github.io kudkudak Plan Goal: talk about modern hyperparameter optimization algorithms Bayes reminder: equivalent linear regression

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 24) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu October 2, 24 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24 Outline Review

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 89 Part II

More information

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

Machine Learning Srihari. Gaussian Processes. Sargur Srihari Gaussian Processes Sargur Srihari 1 Topics in Gaussian Processes 1. Examples of use of GP 2. Duality: From Basis Functions to Kernel Functions 3. GP Definition and Intuition 4. Linear regression revisited

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Outline lecture 4 2(26)

Outline lecture 4 2(26) Outline lecture 4 2(26), Lecture 4 eural etworks () and Introduction to Kernel Methods Thomas Schön Division of Automatic Control Linköping University Linköping, Sweden. Email: schon@isy.liu.se, Phone:

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0. Gaussian Processes Gaussian Process Stochastic process: basically, a set of random variables. may be infinite. usually related in some way. Gaussian process: each variable has a Gaussian distribution every

More information

Bayesian Logistic Regression

Bayesian Logistic Regression Bayesian Logistic Regression Sargur N. University at Buffalo, State University of New York USA Topics in Linear Models for Classification Overview 1. Discriminant Functions 2. Probabilistic Generative

More information

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA Logistic Regression Sargur N. University at Buffalo, State University of New York USA Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative 1. Fixed basis

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Overview Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Probabilistic Interpretation: Linear Regression Assume output y is generated

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Machine Learning. 7. Logistic and Linear Regression

Machine Learning. 7. Logistic and Linear Regression Sapienza University of Rome, Italy - Machine Learning (27/28) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Machine Learning 7. Logistic and Linear Regression Luca Iocchi,

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization

More information

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA Sistemi di Elaborazione dell Informazione Regressione Ruggero Donida Labati Dipartimento di Informatica via Bramante 65, 26013 Crema (CR), Italy http://homes.di.unimi.it/donida

More information

ECE521 Tutorial 2. Regression, GPs, Assignment 1. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides.

ECE521 Tutorial 2. Regression, GPs, Assignment 1. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides. ECE521 Tutorial 2 Regression, GPs, Assignment 1 ECE521 Winter 2016 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides. ECE521 Tutorial 2 ECE521 Winter 2016 Credits to Alireza / 3 Outline

More information

Bias-Variance Trade-off in ML. Sargur Srihari

Bias-Variance Trade-off in ML. Sargur Srihari Bias-Variance Trade-off in ML Sargur srihari@cedar.buffalo.edu 1 Bias-Variance Decomposition 1. Model Complexity in Linear Regression 2. Point estimate Bias-Variance in Statistics 3. Bias-Variance in Regression

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Introduction to Probabilistic Graphical Models

Introduction to Probabilistic Graphical Models Introduction to Probabilistic Graphical Models Sargur Srihari srihari@cedar.buffalo.edu 1 Topics 1. What are probabilistic graphical models (PGMs) 2. Use of PGMs Engineering and AI 3. Directionality in

More information

Introduction to SVM and RVM

Introduction to SVM and RVM Introduction to SVM and RVM Machine Learning Seminar HUS HVL UIB Yushu Li, UIB Overview Support vector machine SVM First introduced by Vapnik, et al. 1992 Several literature and wide applications Relevance

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Computer Vision Group Prof. Daniel Cremers. 3. Regression Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the

More information

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning Basics: Maximum Likelihood Estimation Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Lecture 5: GPs and Streaming regression

Lecture 5: GPs and Streaming regression Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X

More information

Relevance Vector Machines

Relevance Vector Machines LUT February 21, 2011 Support Vector Machines Model / Regression Marginal Likelihood Regression Relevance vector machines Exercise Support Vector Machines The relevance vector machine (RVM) is a bayesian

More information

Variational Bayesian Logistic Regression

Variational Bayesian Logistic Regression Variational Bayesian Logistic Regression Sargur N. University at Buffalo, State University of New York USA Topics in Linear Models for Classification Overview 1. Discriminant Functions 2. Probabilistic

More information

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER 204 205 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE hour Please note that the rubric of this paper is made different from many other papers.

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

GWAS IV: Bayesian linear (variance component) models

GWAS IV: Bayesian linear (variance component) models GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 3 Stochastic Gradients, Bayesian Inference, and Occam s Razor https://people.orie.cornell.edu/andrew/orie6741 Cornell University August

More information

Bayesian methods in economics and finance

Bayesian methods in economics and finance 1/26 Bayesian methods in economics and finance Linear regression: Bayesian model selection and sparsity priors Linear Regression 2/26 Linear regression Model for relationship between (several) independent

More information

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met c Outlines Statistical Group and College of Engineering and Computer Science Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

Machine learning - HT Basis Expansion, Regularization, Validation

Machine learning - HT Basis Expansion, Regularization, Validation Machine learning - HT 016 4. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford Feburary 03, 016 Outline Introduce basis function to go beyond linear regression Understanding

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction

More information

y(x) = x w + ε(x), (1)

y(x) = x w + ε(x), (1) Linear regression We are ready to consider our first machine-learning problem: linear regression. Suppose that e are interested in the values of a function y(x): R d R, here x is a d-dimensional vector-valued

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

Today. Calculus. Linear Regression. Lagrange Multipliers

Today. Calculus. Linear Regression. Lagrange Multipliers Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

Probability and Information Theory. Sargur N. Srihari

Probability and Information Theory. Sargur N. Srihari Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal

More information