Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Similar documents
Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Lecture : Probabilistic Machine Learning

Linear Models for Regression

CMU-Q Lecture 24:

y Xw 2 2 y Xw λ w 2 2

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Today. Calculus. Linear Regression. Lagrange Multipliers

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

Linear Models for Regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

1. Non-Uniformly Weighted Data [7pts]

Multivariate Bayesian Linear Regression MLAI Lecture 11

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Introduction to Machine Learning

Machine Learning Basics: Maximum Likelihood Estimation

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Machine learning - HT Basis Expansion, Regularization, Validation

Lecture 5: GPs and Streaming regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Overfitting, Bias / Variance Analysis

Linear regression. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

The Expectation-Maximization Algorithm

Linear Models for Regression. Sargur Srihari

Linear Models for Regression

Linear Regression (9/11/13)

DS-GA 1002 Lecture notes 12 Fall Linear regression

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Machine Learning Linear Classification. Prof. Matteo Matteucci

An Introduction to Statistical and Probabilistic Linear Models

Bayesian Machine Learning

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

Week 3: Linear Regression

COMS 4771 Regression. Nakul Verma

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Linear Models for Classification

Machine Learning. Regression. Manfred Huber

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Regression with Numerical Optimization. Logistic

Perceptron Revisited: Linear Separators. Support Vector Machines

Week 5: Logistic Regression & Neural Networks

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

CSCI567 Machine Learning (Fall 2014)

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

4 Bias-Variance for Ridge Regression (24 points)

STA 414/2104, Spring 2014, Practice Problem Set #1

Least Squares Regression

Machine Learning Practice Page 2 of 2 10/28/13

CS-E3210 Machine Learning: Basic Principles

Machine Learning. 7. Logistic and Linear Regression

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Linear Models for Regression CS534

y(x) = x w + ε(x), (1)

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Bayesian Linear Regression [DRAFT - In Progress]

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Least Squares Regression

Statistical Machine Learning Hilary Term 2018

PATTERN RECOGNITION AND MACHINE LEARNING

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

Linear Models for Regression CS534

Nonparameteric Regression:

Learning from Data: Regression

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning, Fall 2012 Homework 2

Machine learning - HT Maximum Likelihood

Relevance Vector Machines

Ways to make neural networks generalize better

Linear Model Selection and Regularization

Notes on Discriminant Functions and Optimal Classification

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Machine Learning Lecture 7

CSC321 Lecture 2: Linear Regression

Linear & nonlinear classifiers

Linear regression COMS 4771

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

CSC 411: Lecture 04: Logistic Regression

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

STAT 420: Methods of Applied Statistics

Linear Models for Regression CS534

Nonparametric Regression With Gaussian Processes

Introduction to Gaussian Process

Outline lecture 2 2(30)

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Logistic Regression. Machine Learning Fall 2018

Transcription:

Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop

A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector of D inputs (predictors) for case i, and t i is the target (response) variable for case i, which is real-valued. We are trying to predict t from x, for some future test case, but we are not trying to model the distribution of x. Suppose also that we don t expect the best predictor for t to be a linear function of x, so ordinary linear regression on the original variables won t work well. We need to allow for a non-linear function of x, but we don t have any theory that says what form this function should take. What to do?

An Example Problem As an illustration, we can use the synthetic data set looked at last lecture 50 points generated with x uniform from (0, 1) and y set by the formula: y = sin(1 + x 2 ) + noise where the noise has N(0, 0.03 2 ) distribution. true function and data points The noise-free true function, sin(1 + x 2 ), is shown by the line. This is simpler than real machine learning problems, but lets us look at plots...

Linear Basis Function Models We earlier looked at fitting this data by least-squares linear regression, using not just x, but also x 2, x 3, etc., up to (say) x 4 as predictors. This is an example of a linear basis function model. In general, we do linear regression of t on φ 1 (x), φ 2 (x),..., φ M 1 (x), where the φ j are basis functions, that we have selected to allow for a non-linear function of x. This gives the following model: t = y(x, w) + noise y(x, w) = w 0 + M 1 j=1 w j φ j (x) = w T φ(x) where w is the vector of all M regression coefficients (including the intercept, w 0 ) and φ(x) is the vector of all basis function values at input x, including φ 0 (x) = 1 for the intercept.

Maximum Likelihood Estimation Suppose we assume that the noise in the regression model is Gaussian (normal) with mean zero and some variance σ 2. (The text uses β to denote 1/σ 2.) With this assumption, we can write down the likelihood function for the parameters w and σ, which is the joint probability density of all the targets in the training set as a function of w and σ: L(w, σ) = P (t 1,..., t N x 1,..., x N, w, σ) N = N(t i w T φ(x i ), σ 2 ) i=1 where N(t µ, σ 2 ) is the density for t under a normal distribution with mean µ and variance σ 2. The maximum likelihood estimates for the parameters are the values of w and σ that maximize this likelihood. Equivalently, they maximize the log likelihood, which, ignoring terms that don t depend on w or σ, is log L(w, σ) = N log(σ) 1 2σ 2 N (t i w T φ(x i )) 2 i=1

Least Squares Estimation From this log likelihood function, we see that regardless of what σ might be, the maximum likelihood estimate of w is the value that minimizes the sum of squared prediction errors over training cases. Let s put the values of all basis functions in all training cases into a matrix Φ, with Φ ij = φ j (x i ). Also, put the target values for all training cases into a vector t. We can now write the sum of squared errors on training cases as t Φw 2 = (t Φw) T (t Φw) This is minimized for the value of w where its gradient is zero, which is where 2Φ T (t Φw) = 0 Solving this, the least squares estimate of w is ŵ = (Φ T Φ) 1 Φ T t This assumes that Φ T Φ is non-singular, so that there is a unique solution. When M is greater than N, this will not be the case!

Results with Polynomial Basis Functions Recall we looked before at least-squares fits of polynomial models with increasing order, which can be viewed as basis function models with φ j (x) = x j. second order polynomial model fourth order polynomial model sixth order polynomial model The gray line is the true noise-free function. We see that a second-order fit is too simple, but a sixth-order fit is too complex, producing overfitting.

Gaussian Basis Functions Polynomials are global basis functions, each affecting the prediction over the whole input space. Often, local basis functions are more appropriate. One possibility is to use functions proportional to Gaussian probability densities: φ j (x) = exp( (x µ j ) 2 /2s 2 ) Here are these basis functions for s = 0.1, with the µ j on a grid with spacing s: Gaussian basis functions, s = 0.1

Results with Gaussian Basis Functions Here are the results using Gaussian basis functions (plus φ 0 (x) = 1) on the example dataset, with varying width (and spacing) s: Gaussian basis function fit, s = 0.1 Gaussian basis function fit, s = 0.5 Gaussian basis function fit, s = 2.5 The estimated values for the w j are not what you might guess. For the middle model above, they are as follows: 6856.5-3544.1-2473.7-2859.8-2637.7-2861.5-2468.0-3558.4