Computer Vision Group Prof. Daniel Cremers. 3. Regression

Similar documents
Mathematical Formulation of Our Example

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

PATTERN RECOGNITION AND MACHINE LEARNING

ECE521 week 3: 23/26 January 2017

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Linear Models for Regression

Ch 4. Linear Models for Classification

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Lecture : Probabilistic Machine Learning

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian Machine Learning

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Linear Models for Regression

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

L11: Pattern recognition principles

Machine Learning Lecture 7

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Models for Regression

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

STA414/2104 Statistical Methods for Machine Learning II

STA 4273H: Statistical Machine Learning

Naïve Bayes classification

Pattern Recognition and Machine Learning

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

Linear Models for Classification

Today. Calculus. Linear Regression. Lagrange Multipliers

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Introduction to Machine Learning

CSC321 Lecture 18: Learning Probabilistic Models

Notes on Machine Learning for and

Logistic Regression. COMP 527 Danushka Bollegala

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

An Introduction to Statistical and Probabilistic Linear Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

INTRODUCTION TO PATTERN RECOGNITION

1. Non-Uniformly Weighted Data [7pts]

5. Discriminant analysis

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Generative Clustering, Topic Modeling, & Bayesian Inference

Density Estimation. Seungjin Choi

Data Mining Techniques

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Bayesian Linear Regression. Sargur Srihari

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Introduction to SVM and RVM

Introduction to Machine Learning

Curve Fitting Re-visited, Bishop1.2.5

4 Bias-Variance for Ridge Regression (24 points)

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

CPSC 340: Machine Learning and Data Mining

Linear Models for Regression. Sargur Srihari

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

MTTTS16 Learning from Multiple Sources

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Machine Learning Practice Page 2 of 2 10/28/13

CS534 Machine Learning - Spring Final Exam

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

Machine Learning, Fall 2009: Midterm

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Density Estimation: ML, MAP, Bayesian estimation

y Xw 2 2 y Xw λ w 2 2

Outline lecture 2 2(30)

Kernel Methods. Barnabás Póczos

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Introduction to Machine Learning

CSC 411: Lecture 04: Logistic Regression

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Linear Regression and Its Applications

Chris Bishop s PRML Ch. 8: Graphical Models

Clustering using Mixture Models

Machine Learning: Logistic Regression. Lecture 04

CS 6375 Machine Learning

PCA and admixture models

Clustering VS Classification

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

MODULE -4 BAYEIAN LEARNING

Machine Learning Techniques for Computer Vision

PATTERN CLASSIFICATION

Machine Learning

Statistical Learning. Philipp Koehn. 10 November 2015

Transcription:

Prof. Daniel Cremers 3. Regression

Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the test data Reinforceme nt Learning No supervision, but a reward function Discriminan t Function No prob. formulation, earns a function from objects to Discriminati ve Model Estimates the posterior prob. for each class labels. 2 Generative Model Est. the likelihoods and use Bayes rule for the post.

Categories of Learning Learning Unsupervised Learning clustering, density estimation Supervised Learning learning from a training data set, inference on the test data Reinforcement Learning no supervision, but a reward function Discriminant Function no prob. formulation, learns a function from objects to labels. Discriminative Model estimates the posterior for each class Generative Model est. the likelihoods and use Bayes rule for the post. Computer Vision 3

Mathematical Formulation (Rep.) Suppose we are given a set of objects and a set o of object categories (classes). In the learning task we search for a mapping such that similar elements in are mapped to similar elements in. Difference between regression and classification: In regression, is continuous, in classification it is discrete Regression learns a function, classification usually learns class labels For now we will treat regression 4

Basis Functions In principal, the elements of can be anything (e.g. real numbers, graphs, 3D objects). To be able to treat these objects mathematically we need functions that map from to. We call these the basis functions. We can also interpret the basis functions as functions that extract features from the input data. Features reflect the properties of the objects (width, height, etc.). 5

Simple Example: Linear Regression Assume: Given: data points (identity) Goal: predict the value t of a new example x Parametric formulation: t 5 t 3 t 4 t 1 t 2 x 1 x 2 x 3 x 4 x 5 6

Linear Regression To evaluate the function y, we need an error function: Sum of Squared Errors We search for parameters s.th. is minimal: y(x i, w) =w 0 + w 1 x i ) ry(x i, w) =(1 x i ) Using vector notation: x i := (1 x i ) T y(x i, w) =w T x i re(w) = NX w T x i x T i NX t i x T i = (0 0) ) w T N X x i x T i = NX t i x T i i=1 i=1 i=1 {z } =:A T i=1 {z } =:b T 7

Now we have: Given: data points Polynomial Regression Data Set Size Model Complexity 8

Polynomial Regression We define: And obtain: Basis functions E(w) = 1 2 NX (w T (x i ) t i ) 2 re(w) =w T N X i=1 (x i ) (x i ) T! NX t i (x i ) T i=1 i=1 T 1 Outer Product 1 9

Polynomial Regression We define: And obtain: E(w) = 1 2 NX (w T (x i ) t i ) 2 re(w) =w T N X i=1 (x i ) (x i ) T! NX t i (x i ) T i=1 i=1 T 1 T 2 1 2 10

Polynomial Regression We define: And obtain: E(w) = 1 2 NX (w T (x i ) t i ) 2 re(w) =w T N X i=1 (x i ) (x i ) T! NX t i (x i ) T i=1 i=1 11

Polynomial Regression Thus, we have: where It follows: Pseudoinverse + 12

Computing the Pseudoinverse Mathematically, a pseudoinverse every matrix. + exists for However: If solution of is (close to) singular the direct is numerically unstable. Therefore: Singular Value Decomposition (SVD) is used: = UDV T where matrices U and V are orthogonal matrices D is a diagonal matrix + = VD + U T Then: where contains the D + reciprocal of all non-zero elements of D 13

A Simple Example j(x) =x j 14

Varying the Sample Size 15

The Resulting Model Parameters.500.375.250.125 0 15 8 0-8 -15-23 -30 0.60 0.45 0.30 0.15 0-0.15-0.30-0.45-0.60 8E+05 6E+05 4E+05 2E+05 0E+00-2E+05-4E+05-6E+05-8E+05 16

Other Basis Functions Other basis functions are possible: Gaussian basis function: Sigmoidal basis function: where mean val scale where In both cases a set of mean values is required. These define the locations of the basis functions. 17

Gaussian Basis Functions 18

Sigmoidal Basis Functions 19

Observations The higher the model complexity grows, the better is the fit to the data If the model complexity is too high, all data points are explained well, but the resulting model oscillates very much. It can not generalize well. This is called overfitting. By increasing the size of the data set (number of samples), we obtain a better fit of the model More complex models have larger parameters Problem: How can we find a good model complexity for a given data set with a fixed size? 20

Regularization We observed that complex models yield large parameters, leading to oscillation. Idea: Minimize the error function and the magnitude of the parameters simultaneously We do this by adding a regularization term : where λ rules the influence of the regularization. 21

Regularization As above, we set the derivative to zero: With regularization, we can find a complex model for a small data set. However, the problem now is to find an appropriate regularization coefficient λ. 22

Regularized Results 23

The Problem from a Different View Assume that y is affected by Gaussian noise : Thus, we have where 24

Maximum Likelihood Estimation Aim: we want to find the w that maximizes p. is the likelihood of the measured data given a model. Intuitively: Find parameters w that maximize the probability of measuring the already measured data t. Maximum Likelihood Estimation We can think of this as fitting a model w to the data t. Note: σ is also part of the model and can be estimated. For now, we assume σ is known. 25

Maximum Likelihood Estimation Given data points: Assumption: points are drawn independently from p: where: Instead of maximizing p we can also maximize its logarithm (monotonicity of the logarithm) 26

Maximum Likelihood Estimation i i Constant for all w Is equal to The parameters that maximize the likelihood are equal to the minimum of the sum of squared errors 27

Maximum Likelihood Estimation i i w ML := arg max w ln p(t x, w, ) = arg min w E(w) =( T ) 1 T t The ML solution is obtained using the Pseudoinverse 28

Maximum A-Posteriori Estimation So far, we searched for parameters w, that maximize the data likelihood. Now, we assume a Gaussian prior: Using this, we can compute the posterior (Bayes): Posterior Likelihood Prior Maximum A-Posteriori Estimation (MAP) 29

Maximum A-Posteriori Estimation So far, we searched for parameters w, that maximize the data likelihood. Now, we assume a Gaussian prior: Using this, we can compute the posterior (Bayes): strictly: p(w x, t, 1, 2) = p(t x, w, 1)p(w 2) R p(t x, w, 1)p(w 2)dw but the denominator is independent of w and we want to maximize p. 30

Maximum A-Posteriori Estimation This is equal to the regularized error minimization. The MAP Estimate corresponds to a regularized error minimization where λ = (σ 1 / σ 2 ) 2 31

Summary Regression is a method to find a mathematical model (function) for a given data set Regression can be done by minimizing the sum of squared (SSE) errors, i.e. the distances to the data Maximum-likelihood estimation uses a probabilis-tic representation to fit a model into noisy data Maximum-likelihood under Gaussian noise is equivalent to SSE regression. Maximum-a-posteriori (MAP) estimation assumes a (Gaussian) prior on the model parameters MAP is solved by regularized regression 32

Prof. Daniel Cremers Bayesian Linear Regression

Bayesian Linear Regression Using MAP, we can find optimal model parameters, but for practical applications two questions arise: What happens in the case of sequential data, i.e. the data points are observed subsequently? Can we model the probability of measuring a new data point, given all old data points? This is called the predictive distribution: New target New data Old targets Old data 34

Sequential Data Given: Prior mean and covariance, noise covariance 1.Set 2.Observe data point 3.Formulate the likelihood as a function of w (= Gaussian with mean and covariance ) 4.Multiply the likelihood with the prior normalize (= Gaussian with and ) and 5.This results in a new prior 6.Go back to 1. if there are still data points available 35

A Simple Example Our aim to fit a straight line into a set of data points. Assume we have: Basis functions are equal to identity (x) =x Prior mean is zero, prior covariance is, noise variance is Ground truth is where Data points are sampled from ground truth Thus: We want to recover and from the sequentially incoming data points 36

Bayesian Line Fitting No data points observed Prior Data Space From: C.M. Bishop Hough Space 37 Line examples drawn from the prior

Bayesian Line Fitting One data point observed Ground Truth Likelihood Prior Data Space From: C.M. Bishop Hough Space 38 Line examples drawn from the prior

Bayesian Line Fitting Two data points observed Likelihood Prior Data Space From: C.M. Bishop Hough Space 39 Line examples drawn from the prior

Bayesian Line Fitting 20 data points observed Likelihood Prior Data Space From: C.M. Bishop Hough Space 40 Line examples drawn from the prior

The Predictive Distribution We obtain the predictive distribution by integrating over all possible model parameters: New data likelihood Old data posterior As before the posterior is prop. to the likelihood times the prior. But now, we don t maximize. The posterior can be computed analytically, as the prior is Gaussian. where Prior cov Prior mean 41

The Predictive Distribution (2) Example: Sinusoidal data, 9 Gaussian basis functions, 1 data point The predictive distribution From: C.M. Bishop 42 Some samples from the posterior

Predictive Distribution (3) Example: Sinusoidal data, 9 Gaussian basis functions, 2 data points The predictive distribution From: C.M. Bishop 43 Some samples from the posterior

Predictive Distribution (4) Example: Sinusoidal data, 9 Gaussian basis functions, 4 data points The predictive distribution From: C.M. Bishop 44 Some samples from the posterior

Predictive Distribution (5) Example: Sinusoidal data, 9 Gaussian basis functions, 25 data points The predictive distribution From: C.M. Bishop 45 Some samples from the posterior

Summary A model that has been found using regression can be evaluated in different ways (e.g. loss function, crossvalidation, leave-one-out, BIC) This can be used to adjust model parameters such as using a validation data set Bayesian Linear Regression operates on sequential data and provides the predictive distribution When using Gaussian priors (and Gaussian noise), all computations can be done analytically, as all probabilities remain Gaussian 46