Learning from Data: Regression

Similar documents
Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Machine Learning. 7. Logistic and Linear Regression

CSCI567 Machine Learning (Fall 2014)

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Linear Models for Classification

CSE446: non-parametric methods Spring 2017

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Linear Models for Regression. Sargur Srihari

STA 4273H: Statistical Machine Learning

Linear Models for Regression

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Overfitting, Bias / Variance Analysis

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support Vector Regression (SVR) Descriptions of SVR in this discussion follow that in Refs. (2, 6, 7, 8, 9). The literature

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Linear Models for Regression

STA414/2104 Statistical Methods for Machine Learning II

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

STA 4273H: Sta-s-cal Machine Learning

Week 5: Logistic Regression & Neural Networks

Linear Models for Regression

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

Jeff Howbert Introduction to Machine Learning Winter

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Introduction to Machine Learning

Review: Support vector machines. Machine learning techniques and image analysis

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

CMU-Q Lecture 24:

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Logistic Regression Logistic

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Advanced Introduction to Machine Learning CMU-10715

Linear regression. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Overview. IAML: Linear Regression. Examples of regression problems. The Regression Problem

Machine Learning Lecture 5

CSC 411: Lecture 04: Logistic Regression

Probabilistic & Unsupervised Learning

Today. Calculus. Linear Regression. Lagrange Multipliers

Iterative Reweighted Least Squares

GAUSSIAN PROCESS REGRESSION

Nonparameteric Regression:

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

PATTERN RECOGNITION AND MACHINE LEARNING

Week 3: Linear Regression

Multivariate Bayesian Linear Regression MLAI Lecture 11

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

12 Statistical Justifications; the Bias-Variance Decomposition

Linear classifiers: Overfitting and regularization

Announcements. Proposals graded

Density Estimation: ML, MAP, Bayesian estimation

Bayesian Linear Regression. Sargur Srihari

Classification: Linear Discriminant Analysis

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

6.867 Machine Learning

SVMs, Duality and the Kernel Trick

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Support Vector Machines

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Machine learning - HT Maximum Likelihood

Notes on Discriminant Functions and Optimal Classification

Probabilistic Machine Learning. Industrial AI Lab.

ECE521 week 3: 23/26 January 2017

4 Bias-Variance for Ridge Regression (24 points)

Bayesian Learning (II)

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Classifier Complexity and Support Vector Classifiers

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Linear Methods for Prediction

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

CSC2515 Assignment #2

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Machine Learning 4771

Machine Learning, Fall 2009: Midterm

Artificial Neural Networks 2

Linear Regression (continued)

CS-E3210 Machine Learning: Basic Principles

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Inf2b Learning and Data

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Learning from Data Logistic Regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Transcription:

November 3, 2005 http://www.anc.ed.ac.uk/ amos/lfd/

Classification or Regression? Classification: want to learn a discrete target variable. Regression: want to learn a continuous target variable. Linear regression, generalised linear models, and nonlinear regression. Most regression models can be turned into classification models using the logistic trick of logistic regression.

One Dimensional Data 2.5 2 1.5 1 0.5 0 2 1 0 1 2 3

Linear Regression Simple example: 1 dimensional linear regression. Suppose we have data of the form (x, y), and we believe the data should follow a straight line. However we also believe the target values y are subject to measurement error, which we will assume to be Gaussian. Often use the term error measure for the negative log likelihood. Hence training error, test error. Remember: Gaussian noise results in a quadratic negative log likelihood.

Example Believe the data should have a straight line fit: y = a + bx but that there is some measurement error for y: y = a + bx + η where η is a Gaussian noise term. Training error is µ log P(η = (y µ bx µ a)) = A µ (y µ bx µ a) 2 + B. for training data {(x µ, y µ ); µ = 1,..., N} of size N. A and B depend on the variance of the Gaussian, but do not actually matter in a minimisation problem: we get the same minimum whatever A and B are.

Generated Data 2.5 2 1.5 1 0.5 0 2 1 0 1 2 3

Multivariate Case Consider the case where we are interested in y = f (x) for D dimensional x: y = a + b T x In fact if we set w = (a, b T ) T and introduce φ = (1, x T ) T, then we can write y = w T φ for the new augmented variables. The training error (up to an additive and multiplicative constant) is then E(w) = where φ µ = (1, (x µ ) T ) T. N (y µ w T φ µ ) 2 µ=1

Maximum Likelihood Solution Minimum training error equals maximum log-likelihood. Take derivatives of the training error: N w E(w) = 2 φ µ (w T φ µ y µ ) µ=1 Write Φ = (φ 1, φ 2,..., φ N ), and y = (y 1, y 2,..., y N ) T. Then w E(w) = 2Φ(Φ T w y)

Maximum Likelihood Solution Setting the derivatives to zero to find the minimum gives ΦΦ T w = Φy This means the maximum likelihood w is given by w = (ΦΦ T ) 1 Φy The term (ΦΦ T ) 1 Φ is called the pseudo-inverse.

Generated Data 2.5 2 1.5 1 0.5 0 0.5 2 1 0 1 2 3 The black line is the maximum likelihood fit to the data.

Recap Error measure is the negative log likelihood Gaussian error term is the sum-squared error (up to a multiplicative and additive constant). Write down the regression error term. Build weight vector w and data vector φ. Take derivatives and set to zero to obtain pseudo-inverse solution.

But... All this just used φ. We chose to put the x values in φ, but we could have put anything in there, including nonlinear transformations of the x values. In fact we can choose any useful form for φ so long as the final derivatives are linear in w. We can even change the size. We already have the maximum likelihood solution in the case of Gaussian noise: the pseudo-inverse solution. Models of this form are called generalized linear models or linear parameter models.

Example:polynomial fitting Model y = w 1 + w 2 x + w 3 x 2 + w 4 x 3. Set φ = (1, x, x 2, x 3 ) T and w = (w 1, w 2, w 3, w 4 ). Can immediately write down the ML solution: w = (ΦΦ T ) 1 Φy, where Φ and y are defined as before.

Higher dimensional outputs Suppose the target values are vectors y. Then we introduce different w i for each y i. Then we can do regression independently in each of those cases.

Radial Basis Models Set φ i (x) = exp( 1 2 (x mi ) 2 /α 2 ). Need to position these basis functions at some prior chosen centres m i and with a given width α. We will discuss how the centres and widths can also be considered as parameters in a future lecture. Finding the weights is the same as ever: the pseudo-inverse solution.

Dimensionality Issues How many radial basis bumps do we need? Suppose we only needed 3 for a 1D regression problem. The we would need 3 D for a D dimensional problem. This becomes large very fast: this is commonly called the curse of dimensionality.

Model comparison How do we compare different models? For example we could introduce 1,2,... 4000 radial basis functions. The more parameters the model has, the better it will do. Models with huge numbers of parameters could fit the training data perfectly. Is this a problem?

Summary Lots of different models are linear in the parameters For regression models the maximum likelihood solution is analytically calculable. The optimum value is given by the pseudo-inverse solution. Overfitting.