Bayesian Linear Regression. Sargur Srihari
|
|
- Kathlyn Osborne
- 6 years ago
- Views:
Transcription
1 Bayesian Linear Regression Sargur
2 Topics in Bayesian Regression Recall Max Likelihood Linear Regression Parameter Distribution Predictive Distribution Equivalent Kernel 2
3 Linear Regression: model complexity M Polynomial regression y(x, w) = w 0 + w 1 x + w 2 x w M x M = w j x j Red lines are best fits with M = 0,1,3,9 and =10 M j =0 Poor representations of sin(2πx) Best Fit to sin(2πx) Over Fit Poor representation of sin(2πx) 3
4 Max Likelihood Regression Input vector x, basis functions {ϕ 1 (x),.., ϕ M (x)}: y(x,w) = M 1 j =0 w j φ j (x) = w T φ(x) Objective Function: Closed-form ML solution is: w ML = (Φ T Φ) 1 Φ T t w ML = (λi + Φ T Φ) 1 Φ T t Gradient Descent: { } Radial basis fns: Max Likelihood objective with examples {x 1,..x }: (equivalent to Mean Squared Error Objective) Regularized solution is: Regularized MSE with examples: (λ is the regularization coefficient) E(w) = 1 2 E(w) = 1 2 where Φ is the design matrix: (Φ T Φ) is Moore-Penrose inverse w (τ +1) = w (τ ) η E φ j (x) = exp 1 2 (x µ j )t Σ 1 (x µ j ) { t n w T φ(x n ) } { t n w T φ(x n ) } 2 + λ 2 wt w φ 0(x1) φ 0(x 2) Φ = φ 0(x ) { } E = t n w (τ)t φ(x n ) φ(x n ) Regularized version: E = t n w (τ )T φ(x n ) φ(x n ) λw (τ ) 2 φ (x 1 1 )... φ φ M 1 M 1 (x1) (x ) 4
5 Shortcomings of MLE M.L.E. of parameters w does not address M (Model complexity: how many basis functions? It is controlled by data size More data allows better fit without overfitting Regularization also controls overfit (λ controls effect) E(w) = E D (w) + λe W (w) where E D (w) = 1 2 { t n w T φ(x n ) } But M and choice of ϕ j are still important M can be determined by holdout, but wasteful of data E W (w) = 1 2 wt w Model complexity and over-fitting are better handled using Bayesian approach 5 2
6 Bayesian Linear Regression Using Bayes rule, posterior is proportional to Likelihood Prior: p(w t) = where p(t w) is the likelihood of observed data p(w) is prior distribution over the parameters We will look at: p(t w)p(w) p(t) A normal distribution for prior p(w) Likelihood p(t w) is a product of Gaussians based on the noise model And conclude that posterior is also Gaussian 6
7 Gaussian Prior Parameters Assume multivariate Gaussian prior for w (which has components w 0,..,w M ) p(w) = (w m 0, S 0 ) with mean m 0 and covariance matrix S 0 If we choose S 0 = α I it means that the variances of the weights are all equal to α and covariances are zero p(w) with zero mean (m 0 =0) and isotropic over weights (same variances) w 1 w 0 7
8 Likelihood of Data is Gaussian Assume noise precision parameter β t = y(x,w)+ε where ε is defined probabilistically as Gaussian noise p(t x,w,β)=(t y(x,w),β ) ote that output t is a scalar Likelihood of t ={t 1,..,t } is then p(t X,w, β) = ( t n w T φ(x n ), β ) 1 This is the probability of target data t given the parameters w and input X={x 1,..,x } Due to Gaussian noise, likelihood p(t w) is also a Gaussian 8
9 Posterior Distribution is also Gaussian Prior: p(w)~ (w m 0, S 0 ) i.e., it is Gaussian Likelihood comes from Gaussian noise p(t X,w, β) = It follows that posterior p(w t) is also Gaussian Proof: use standard result from Gaussians: If marginal p(w) & conditional p(t w) have Gaussian forms then the marginals p(t) and p(w t) are also Gaussian: Let p(w) = (w µ,λ ) and p(t w) = (t Aw+b,L ) Then marginal p(t) = (t Aµ+b,L +AΛ A T ) and conditional p(w t) = (w Σ{A t L(t-b)+Λµ},Σ) where Σ=(Λ+A T LA) ( ) t n w T φ(x n ), β 1 9
10 Exact form of Posterior Distribution We have p(w)= (w m 0, S 0 ) & p(t X,w, β) = ( t n w T φ(x n ), β ) 1 Posterior is also Gaussian, written directly as p(w t)=(w m,s ) where m is the mean of the posterior given by m = S (S 0 m 0 + β Φ T t) Φ is the design matrix and S is the covariance matrix of posterior given by S = S 0 + β Φ T Φ φ 0(x1) φ 0(x 2) Φ = φ 0(x ) φ (x ) φ M 1(x 1) φ M 1(x ) w 1 Prior p(w α) = (w 0,α 1 I ) and Posterior in weight space for scalar input x and y(x,w)=w 0 +w 1 x w 0 w 1 w 1 w 0 10
11 Properties of Posterior 1. Since posterior p(w t)=(w m,s ) is Gaussian its mode coincides with its mean Thus maximum posterior weight is w MAP = m 2. Infinitely broad prior S 0 =α I, i.e.,precision α à0 Then mean m reduces to the maximum likelihood value, i.e., mean is the solution vector w ML = (Φ T Φ) 1 Φ T t 3. If = 0, posterior reverts to the prior 4. If data points arrive sequentially, then posterior to any stage acts as prior distribution for subsequent data points 11
12 Choose a simple Gaussian prior p(w) y(x,w)=w 0 +w 1 x p(w α)~(0,1/α) Zero mean (m 0 =0) isotropic (same variances) Gaussian p(w α) (w 0,α 1 I) Corresponding posterior distribution is p(w t)=(w m,s ) where m =β S Φ T t and S =α I+β Φ T Φ ote: β is noise precision and α is variance of parameter w in prior w 1 Single precision parameter α Point Estimate with infinite samples w 0 12
13 Equivalence to MLE with Regularization Since we have Log of Posterior is ln p(w t) = β 2 ( ) and { t n w T φ(x n )} Thus Maximization of posterior is equivalent to minimization of sum-of-squares error { t n w T φ(x n ) } 2 + λ 2 wt w E(w) = 1 2 p(t X,w, β) = t n w T φ(x n ), β 1 p(w α) = (w 0,α 1 I ) p(w t) = ( ) t n w T φ(x n ), β 1 (w 0,α 1 I) with addition of quadratic regularization term w T w with λ = α /β 2 α 2 wt w + const 13
14 Bayesian Linear Regression Example (Straight Line Fit) Single input variable x Single target variable t Goal is to fit Linear model y(x,w) = w 0 + w 1 x Goal of Linear Regression is to recover w =[w 0,w 1 ] given the samples t x 14
15 Data Generation Synthetic data generated from f(x,w)=w 0 +w 1 x with parameter values w 0 = -0.3 and w 1 =0.5 t x 1 First choose x n from U(x,1), then evaluate f(x n,w) Add Gaussian noise with st dev 0.2 to get target t n Precision parameter β = (1/0.2 ) 2 = 25 For prior over w we choose α = 2 p(w α) = (w 0,α 1 I ) 15
16 Sampling p(w) and p(w t) Each sample represents a straight line in data space (modified by examples) Distribution y(x,w)=w 0 +w 1 x Six samples w 1 p(w) With no examples: w 0 p(w t) With two examples: Goal of Bayesian Linear Regression: Determine p(w t) 16
17 Sequential Bayesian Learning Machine Learning Since there are only two parameters We can plot prior and posterior distributions in parameter space We look at sequential update of posterior We are plotting p(w t) for a single data point Before data points observed After first data point (x,t) observed Band represents values of w 0, w 1 representing st lines going near data point x Likelihood for 2 nd point alone Likelihood p(t x.w) as function of w True parameter Value X Prior/ Posterior p(w) gives p(w t) Six samples (regression functions) corresponding to y(x,w) with w drawn from posterior o Data Point First Data Point (x 1,t 1 ) Second Data Point With infinite points posterior is a delta function centered at true parameters (white cross) Likelihood for 20 th point alone Twenty Data Points 17
18 Generalization of Gaussian prior The Gaussian prior over parameters is p(w α) = (w 0,α 1 I) Maximization of posterior ln p(w t) is equivalent to minimization of sum of squares error E(w) = 1 { t 2 n w T φ(x n ) } 2 + λ 2 wt w Other prior yields Lasso and variations: p(w α) = q α 2 2 q=2 corresponds to Gaussian 1/q 1 Γ(1 /q) M exp α 2 w j q Corresponds to minimization of regularized error function M t n w T φ(x n ) w j q 1 2 { } 2 + λ 2 j =1 M j =1 18
19 Predictive Distribution Usually not interested in the value of w itself But predicting t for a new value of x p(t t,x,x) or p(t t) Leaving out conditioning variables X and x for convenience Marginalizing over parameter variable w, is the standard Bayesian approach Sum rule of probability We can now write p(t)= p(t,w)dw = p(t w)p(w)dw p(t t)= p(t w)p(w t)dw 19
20 Predictive Distribution with α, β,x,t We can predict t for a new value of x using p(t t)= p(t w)p(w t)dw With explicit dependence on prior parameter α, noise parameter β, & targets in training set t p(t t,α,β)= We have left out conditioning variables X and x for convenience. Also we have applied sum rule of probability p(t)=σ w p(t w)p(w) p(t w,β) p(w t,α,β)d w Conditional of target t given weight w p(t x,w, β) = (t y(x,w), β 1 ) posterior of weight w p(w t)=(w m,s ) where m =β S Φ T t S =α I+β Φ T Φ RHS is a convolution of two Gaussian distributions whose result is the Gaussian: p(t x,t,α, β) = (t m T φ(x),σ 2 (x)) where σ 2 (x) = 1 β + φ(x)t S φ(x)
21 Variance of Predictive Distribution Predictive distribution is a Gaussian: p(t x, t,α, β) = (t m T φ(x),σ 2 (x)) where σ 2 (x) = 1 β + φ(x)t S φ(x) Since noise process and distribution of w are independent Gaussians their variances are additive oise in data Uncertainty associated with parameters w: where S is the covariance of p(w α) =α I+β Φ T Φ 2 Since σ +1 (x) σ 2 (x) as no. of samples increases it becomes narrower As à, second term of variance goes to zero and variance of predictive distribution arises solely from the additive noise parameter β 21
22 Example of Predictive Distribution Data generated from sin(2πx) Model: nine Gaussian basis functions y(x,w) = 8 j=0 w j φ j (x) = w T φ(x) Predictive distribution p(t x, t,α, β) = (t m T φ(x),σ 2 (x)) φ j (x) = exp (x µ j )2 2σ 2 where σ 2 (x) = 1 β + φ(x)t S φ(x) Plot of p(t x) for one data point showing mean (red) and one std dev (pink) where m =β S Φ T t, S =α I+β Φ T Φ and α and β come from assumptions p(w α)= (w 0, α I ) p(t x,w, β) = (t y(x,w), β 1 ) Mean of Predictive Distribution 22
23 Predictive Distribution Variance Bayesian prediction: p(t x, t,α, β) = (t m T φ(x),σ 2 (x))where σ 2 (x) = 1 β + φ(x)t S φ(x) where we have assumed =1 =2 Gaussian prior over parameters: p(w α) = (w 0,α 1 I) oise model assumed Gaussian: p(t x,w,β)=(t y(x,w),β ) and use design matrix as: S =α I+β Φ T Φ 0 Using data from sin(2πx): φ 0(x1) φ 0(x 2) Φ = φ (x ) φ (x ) σ 2 (x), std dev of t, is smallest in neighborhood of data points φ M 1(x 1) φ M 1(x ) Mean of the Gaussian Predictive Distribution One standard deviation from Mean =4 =25 Uncertainty decreases as more data points are observed Plot only shows point-wise predictive variance To show covariance between predictions at different values of x draw samples from posterior distribution over w p(w t) and plot corresponding functions y(x,w)
24 Plots of function y(x,w) Draw samples w from from posterior distribution p(w t) p(w t)=(w m,s ) and plot samples from y(x,w) = w T ϕ(x) Shows covariance between predictions at different values of x For a given function, for a pair of x,x, the values of y,y are determined by k(x,x ) which in turn is determined by the samples =1 =2 =4 =25 24
25 Disadvantage of Local Basis Predictive distribution, assuming Gaussian prior p(w α) = (w 0,α 1 I) and Gaussian noise t = y(x,w)+ε where noise is defined probabilistically as p(t x,w,β)=(t y(x,w),β ) p(t x, t,α, β) = (t m T φ(x),σ 2 (x)) where σ 2 (x) = 1 β + φ(x)t S φ(x) S =α I+β Φ T Φ With localized basis functions, e.g., Gaussian at regions away from basis function centers, contribution of second term of variance σ n 2 in will go to zero leaving only noise contribution β Model becomes very confident outside of region occupied by basis functions Problem avoided by alternative Bayesian approach of Gaussian Processes 25
26 Dealing with unknown β If both w and β are treated as unknown then we can introduce a conjugate prior distribution p(w,β) which is given by a Gaussian-gamma distribution p(µ,λ) = µ µ 0, βλ ( ) 1 Gam λ a,b ( ) In this case the predictive distribution is a Student s t-distribution St( x µ, λ, ν ) = Γ( ν / 2 + 1/ 2) Γ( ν + 2) λ πν 1/ λ( x µ ) ν 2 ν / 2 1/ 2 26
27 Mean of p(w t) has Kernel Interpretation Regression function is: y(x,w) = M 1 j =0 If we take a Bayesian approach with Gaussian prior p(w)=(w m 0, S 0 ) then we have: Posterior p(w t)= (w m,s ) where m = S (S 0 m 0 + βφ T t) S = S 0 + βφ T Φ With zero mean isotropic p(w α)= (w 0, α I) m = β S Φ T t, w j φ j (x) = w T φ(x) S = α I+ β Φ T Φ Posterior mean β S Φ T t has a kernel interpretation Sets stage for kernel methods and Gaussian processes 27
28 Equivalent Kernel Posterior mean of w is m =βs Φ T t where S = S 0 + βφ T Φ, S 0 is the covariance matrix of the prior p(w), β is the noise parameter and Φ is the design matrix that depends on the samples Substitute mean value into Regression function y(x,w) = M 1 j =0 w j φ j (x) = w T φ(x) Mean of predictive distribution at point x is y(x,m ) = m T φ(x) = βφ(x) T S Φ T t = βφ(x) T S φ(x n )t n = k(x,x n )t n where k(x,x )=βϕ (x) T S ϕ (x ) is the equivalent kernel Thus mean of predictive distribution is a linear combination of training set target variables t n ote: the equivalent kernel depends on input values x n from the dataset because they appear in S
29 Kernel Function Regression functions such as y(x,m ) = k(x,x n )t n k(x,x )=βϕ (x) T S ϕ (x ) S = S 0 + βφ T Φ That take a linear combination of the training set target values are known as linear smoothers φ 0(x1) φ 0(x 2) Φ = φ 0(x ) They depend on the input values x n from the data set since they appear in the definition of S 29 φ (x ) φ M 1(x 1) φ M 1(x )
30 Example of kernel for Gaussian Basis Equivalent Kernel k(x, x )=ϕ(x) T S ϕ(x ) S = S 0 + βφ T Φ φ 0(x1) φ 0(x 2) Φ = φ 0(x ) φ (x 1 1 )... φ φ M 1 M 1 (x1) (x ) For three values of x the behavior of k(x,x ) is shown as a slice Kernel used directly in regression. Mean of the predictive distribution is Kernels are localized around x, i.e., peaks when x =x y(x,m ) = k(x,x n )t n Obtained by forming a weighted combination of target values: Data points close to x are given higher weight than points further removed from x Gaussian Basis ϕ(x) x φ j (x) = exp (x µ j )2 2s 2 x Plot of k(x,x ) shown as a function of x and x Peaks when x=x Data set used to generate kernel were 200 values of x equally spaced in (,1) 30
31 Equivalent Kernel for Polynomial Basis Function ϕ j (x)=x j k(x,x )=βϕ (x) T S ϕ (x ) S = S 0 + β Φ T Φ Data points close to x are given higher weight than points further removed from x Plotted as a function of x for x=0 Localized function of x even though corresponding basis function is nonlocal 31
32 Equivalent Kernel for Sigmoidal Basis Function φ j (x) = σ x µ j s where σ(a) = exp( a) k(x,x )=βϕ(x) T S ϕ(x ) Localized function of x even though corresponding basis function is nonlocal 32
33 Covariance between y(x) and y(x ) An important insight: The value of the kernel function between two points is directly related to the covariance between their target values cov [y(x), y(x )] = cov[ϕ(x) T w, w T ϕ (x )] = ϕ (x) T S ϕ (x ) = β k(x, x ) From the form of the equivalent kernel k(x, x ) the predictive mean at nearby points y(x), y(x ) will be highly correlated For more distant pairs correlation is smaller The kernel captures the covariance where we have used: p(w t)~(w m,s ) k(x, x )= βϕ(x) T S ϕ (x) 33
34 Predictive plot vs. Posterior plots Predictive distribution allows us to visualize pointwise uncertainty in the predictions governed by p(t x, t,α, β) = (t m T φ(x),σ 2 (x)) where σ 2 (x) = 1 β + φ(x)t S φ(x) Drawing samples from posterior p(w t) Plotting corresponding functions y(x,w) we visualize joint uncertainty in the posterior distribution between the y values at two or more x values as governed by the kernel 34
35 Directly Specifying Kernel Function Formulation of Linear Regression in terms of kernel function suggests an alternative approach to regression: Instead of introducing a set of basis functions, which implicitly determines an equivalent kernel: Directly define kernel functions and use to make predictions for new input x, given the observation set This leads to a practical framework for regression (and classification) called Gaussian Processes 35
36 Summing Kernel Values Over samples Effective kernel defines weights by which target values combined to make a prediction at x It can be shown that weights sum to one, i.e., For all values of x k(x,x n ) = 1 This result can be proven intuitively: k(x, x )=ϕ(x) T S ϕ(x ) S = S 0 + βφ T Φ Since y(x, m ) = k(x,x n )t n summation is equivalent to considering predictive mean ŷ(x) for a set of integer data in which t n =1 for all n Provided basis functions are linearly independent, that >M, one of the basis functions is constant (corresponding to the bias parameter), then we can fit 36 training data exactly, and hence ŷ(x)=1
37 Kernel Function Properties Equivalent kernel can be positive or negative Although it satisfies a summation constraint, the corresponding predictions are not necessarily a convex combination of the training set target variables Equivalent kernel satisfies important property shared by kernel functions in general. It can be expressed in the form of an inner product wrt a vector ψ(x) of nonlinear functions: k(x,z) = ψ(x) T ψ(z) k(x, x )=ϕ(x) T S ϕ(x ) where ψ(x) = β 1/2 S 1/2 φ(x) 37
These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
More informationLinear Models for Regression. Sargur Srihari
Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood
More informationOutline Lecture 2 2(32)
Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic
More informationThe Laplace Approximation
The Laplace Approximation Sargur N. University at Buffalo, State University of New York USA Topics in Linear Models for Classification Overview 1. Discriminant Functions 2. Probabilistic Generative Models
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationOutline lecture 2 2(30)
Outline lecture 2 2(3), Lecture 2 Linear Regression it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic Control
More informationLinear Models for Regression
Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationLinear Models for Regression
Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationIntroduction to Machine Learning
Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1
More informationLinear Models for Regression
Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 The Regression Problem Training data: A set of input-output
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More informationLINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception
LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,
More informationBayesian Linear Regression [DRAFT - In Progress]
Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory
More informationIterative Reweighted Least Squares
Iterative Reweighted Least Squares Sargur. University at Buffalo, State University of ew York USA Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative
More informationMachine Learning Srihari. Probability Theory. Sargur N. Srihari
Probability Theory Sargur N. Srihari srihari@cedar.buffalo.edu 1 Probability Theory with Several Variables Key concept is dealing with uncertainty Due to noise and finite data sets Framework for quantification
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationINTRODUCTION TO PATTERN RECOGNITION
INTRODUCTION TO PATTERN RECOGNITION INSTRUCTOR: WEI DING 1 Pattern Recognition Automatic discovery of regularities in data through the use of computer algorithms With the use of these regularities to take
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationRegression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.
Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh September 24 (All of the slides in this course have been adapted from previous versions
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationGaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak
Gaussian processes and bayesian optimization Stanisław Jastrzębski kudkudak.github.io kudkudak Plan Goal: talk about modern hyperparameter optimization algorithms Bayes reminder: equivalent linear regression
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationMachine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari
Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets
More informationPattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods
Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter
More informationCSCI567 Machine Learning (Fall 2014)
CSCI567 Machine Learning (Fall 24) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu October 2, 24 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24 Outline Review
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationIntroduction to Machine Learning
Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 89 Part II
More informationMachine Learning Srihari. Gaussian Processes. Sargur Srihari
Gaussian Processes Sargur Srihari 1 Topics in Gaussian Processes 1. Examples of use of GP 2. Duality: From Basis Functions to Kernel Functions 3. GP Definition and Intuition 4. Linear regression revisited
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationOutline lecture 4 2(26)
Outline lecture 4 2(26), Lecture 4 eural etworks () and Introduction to Kernel Methods Thomas Schön Division of Automatic Control Linköping University Linköping, Sweden. Email: schon@isy.liu.se, Phone:
More informationMachine Learning - MT & 5. Basis Expansion, Regularization, Validation
Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationGAUSSIAN PROCESS REGRESSION
GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The
More informationDEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE
Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures
More informationLinear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.
Gaussian Processes Gaussian Process Stochastic process: basically, a set of random variables. may be infinite. usually related in some way. Gaussian process: each variable has a Gaussian distribution every
More informationBayesian Logistic Regression
Bayesian Logistic Regression Sargur N. University at Buffalo, State University of New York USA Topics in Linear Models for Classification Overview 1. Discriminant Functions 2. Probabilistic Generative
More informationLogistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA
Logistic Regression Sargur N. University at Buffalo, State University of New York USA Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative 1. Fixed basis
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationOverview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation
Overview Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Probabilistic Interpretation: Linear Regression Assume output y is generated
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationMachine Learning. 7. Logistic and Linear Regression
Sapienza University of Rome, Italy - Machine Learning (27/28) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Machine Learning 7. Logistic and Linear Regression Luca Iocchi,
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationLinear Regression. CSL603 - Fall 2017 Narayanan C Krishnan
Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization
More informationLinear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan
Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis
More informationBayesian Regression Linear and Logistic Regression
When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationSCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati
SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA Sistemi di Elaborazione dell Informazione Regressione Ruggero Donida Labati Dipartimento di Informatica via Bramante 65, 26013 Crema (CR), Italy http://homes.di.unimi.it/donida
More informationECE521 Tutorial 2. Regression, GPs, Assignment 1. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides.
ECE521 Tutorial 2 Regression, GPs, Assignment 1 ECE521 Winter 2016 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides. ECE521 Tutorial 2 ECE521 Winter 2016 Credits to Alireza / 3 Outline
More informationBias-Variance Trade-off in ML. Sargur Srihari
Bias-Variance Trade-off in ML Sargur srihari@cedar.buffalo.edu 1 Bias-Variance Decomposition 1. Model Complexity in Linear Regression 2. Point estimate Bias-Variance in Statistics 3. Bias-Variance in Regression
More informationMidterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationIntroduction to Probabilistic Graphical Models
Introduction to Probabilistic Graphical Models Sargur Srihari srihari@cedar.buffalo.edu 1 Topics 1. What are probabilistic graphical models (PGMs) 2. Use of PGMs Engineering and AI 3. Directionality in
More informationIntroduction to SVM and RVM
Introduction to SVM and RVM Machine Learning Seminar HUS HVL UIB Yushu Li, UIB Overview Support vector machine SVM First introduced by Vapnik, et al. 1992 Several literature and wide applications Relevance
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationComputer Vision Group Prof. Daniel Cremers. 3. Regression
Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the
More informationMachine Learning Basics: Maximum Likelihood Estimation
Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More informationCSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes
CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More informationLecture 5: GPs and Streaming regression
Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X
More informationRelevance Vector Machines
LUT February 21, 2011 Support Vector Machines Model / Regression Marginal Likelihood Regression Relevance vector machines Exercise Support Vector Machines The relevance vector machine (RVM) is a bayesian
More informationVariational Bayesian Logistic Regression
Variational Bayesian Logistic Regression Sargur N. University at Buffalo, State University of New York USA Topics in Linear Models for Classification Overview 1. Discriminant Functions 2. Probabilistic
More informationDEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE
Data Provided: None DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER 204 205 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE hour Please note that the rubric of this paper is made different from many other papers.
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationModeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop
Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector
More informationLinear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging
More informationGWAS IV: Bayesian linear (variance component) models
GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian
More informationEE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015
EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 3 Stochastic Gradients, Bayesian Inference, and Occam s Razor https://people.orie.cornell.edu/andrew/orie6741 Cornell University August
More informationBayesian methods in economics and finance
1/26 Bayesian methods in economics and finance Linear regression: Bayesian model selection and sparsity priors Linear Regression 2/26 Linear regression Model for relationship between (several) independent
More informationOverview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met
c Outlines Statistical Group and College of Engineering and Computer Science Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More
More informationLinear Dynamical Systems
Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations
More informationMachine learning - HT Basis Expansion, Regularization, Validation
Machine learning - HT 016 4. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford Feburary 03, 016 Outline Introduce basis function to go beyond linear regression Understanding
More informationCS-E3210 Machine Learning: Basic Principles
CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction
More informationy(x) = x w + ε(x), (1)
Linear regression We are ready to consider our first machine-learning problem: linear regression. Suppose that e are interested in the values of a function y(x): R d R, here x is a d-dimensional vector-valued
More informationGaussian Process Regression
Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process
More informationToday. Calculus. Linear Regression. Lagrange Multipliers
Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the
More informationLeast Squares Regression
E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables
More informationProbability and Information Theory. Sargur N. Srihari
Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal
More information