Linear Regression and Discrimination

Similar documents
Introduction to Machine Learning

Multivariate Bayesian Linear Regression MLAI Lecture 11

CS 195-5: Machine Learning Problem Set 1

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Introduction to Machine Learning

Machine learning - HT Maximum Likelihood

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Multivariate statistical methods and data mining in particle physics

Motivating the Covariance Matrix

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

An Introduction to Statistical and Probabilistic Linear Models

Lecture 9: Classification, LDA

Introduction to Machine Learning Spring 2018 Note 18

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

LDA, QDA, Naive Bayes

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Linear Models for Regression

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

COM336: Neural Computing

Discriminant Analysis Documentation

GWAS IV: Bayesian linear (variance component) models

PATTERN RECOGNITION AND MACHINE LEARNING

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Introduction to Machine Learning

Machine Learning - MT Classification: Generative Models

Introduction to Machine Learning

Nonparameteric Regression:

GWAS V: Gaussian processes

Least Squares Regression

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning (CS 567) Lecture 5

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear Classification

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Bayesian Linear Regression [DRAFT - In Progress]

6.867 Machine Learning

Bayesian Learning (II)

A Study of Relative Efficiency and Robustness of Classification Methods

CS534: Machine Learning. Thomas G. Dietterich 221C Dearborn Hall

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Least Squares Regression

Probabilistic generative models

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Curve Fitting Re-visited, Bishop1.2.5

Advanced statistical methods for data analysis Lecture 1

International Journal of Pure and Applied Mathematics Volume 19 No , A NOTE ON BETWEEN-GROUP PCA

Ch 4. Linear Models for Classification

Association studies and regression

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Does Modeling Lead to More Accurate Classification?

Bayesian Decision and Bayesian Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Linear Regression (continued)

Lecture 4 Discriminant Analysis, k-nearest Neighbors

The Multivariate Gaussian Distribution [DRAFT]

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

5. Discriminant analysis

Machine Learning

Naïve Bayes classification

CMSC858P Supervised Learning Methods

Machine Learning Linear Classification. Prof. Matteo Matteucci

Overfitting, Bias / Variance Analysis

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

Bayesian Learning. Bayesian Learning Criteria

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Statistical Data Mining and Machine Learning Hilary Term 2016

Classification 2: Linear discriminant analysis (continued); logistic regression

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Linear Models for Regression

GI01/M055: Supervised Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Lecture 4: Probabilistic Learning

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

PMR Learning as Inference

Bayesian Linear Regression. Sargur Srihari

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018

Solving the SVM Optimization Problem

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Parameter Estimation

Discriminative Learning and Big Data

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

CS-E3210 Machine Learning: Basic Principles

Transcription:

Linear Regression and Discrimination Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian Igel: Linear Regression and Discrimination 1 / 21

Outline Warmup: Gaussian distributions Linear Regression Sum-of-squares Error Solution Using Pseudo-Inverse Linear Discriminant Analysis Christian Igel: Linear Regression and Discrimination 2 / 21

Outline Warmup: Gaussian distributions Linear Regression Sum-of-squares Error Solution Using Pseudo-Inverse Linear Discriminant Analysis Christian Igel: Linear Regression and Discrimination 3 / 21

Recall: Gaussian Distribution Gaussian distribution of a single real-valued variable with mean µ R and variance σ 2 : N(x µ,σ 2 ) = 1 { exp 1 } 2πσ 2σ2(x µ)2 Multivariate Gaussian distribution of a d-dimensional real-valued random vector with mean µ R d and covariance matrix Σ R d d : 1 N(x µ,σ) = { (2π) d detσ exp 1 } 2 (x µ)t Σ 1 (x µ) Christian Igel: Linear Regression and Discrimination 4 / 21

Outline Warmup: Gaussian distributions Linear Regression Sum-of-squares Error Solution Using Pseudo-Inverse Linear Discriminant Analysis Christian Igel: Linear Regression and Discrimination 5 / 21

Linear functions f(x) = x,w + b x,w +b w x,w + b > 0 x x x,w + b = 0 b/ w w x x,w + b < 0 Christian Igel: Linear Regression and Discrimination 6 / 21

Recall: Linear regression Goal of regression: Given an input, predict corresponding output from Y R m, 1 m <. Let s consider m = 1. We consider a mapping/encoding/representation φ : X R d, φ(x) = (φ 1 (x),...,φ d (x)) T. The φ i are called basis functions. In linear { regression, the hypotheses are: d H = i=1 w iφ i (x) + b } w R d,b R For convenience, we redefine φ : X R d+1, φ(x) = (φ 1 (x),...,φ d (x),1) T and write H = { w T φ(x) w R d+1 }. Christian Igel: Linear Regression and Discrimination 7 / 21

Outline Warmup: Gaussian distributions Linear Regression Sum-of-squares Error Solution Using Pseudo-Inverse Linear Discriminant Analysis Christian Igel: Linear Regression and Discrimination 8 / 21

Sum-of-squares error and Gaussian Noise I consider a deterministic target functions f : X R with zero-mean additive Gaussian noise with variance σ 2, then we have p(y x) = N(y f(x),σ 2 ) given S = {(x 1,y 1 ),...,(x l,y l )} the likelihood of a hypothesis h(x) = w T φ(x) is given by p((y 1,...,y l ) T (x 1,...,x l ),w,σ 2 ) := p(y w) l = N(y i w T φ(x i ),σ 2 ) i=1 Christian Igel: Linear Regression and Discrimination 9 / 21

Sum-of-squares error and Gaussian Noise II for convenience, we consider the logarithm ln p(y w) = ln = l N(y i w T φ(x i ),σ 2 ) i=1 l ln N(y i w T φ(x i ),σ 2 ) i=1 = l 2 ln σ2 l 2 ln(2π) 1 2σ 2 l i=1 { } 2 y i w T φ(x i ) thus, here maximizing the likelihood corresponds to minimizing the sum-of-squares loss Christian Igel: Linear Regression and Discrimination 10 / 21

Outline Warmup: Gaussian distributions Linear Regression Sum-of-squares Error Solution Using Pseudo-Inverse Linear Discriminant Analysis Christian Igel: Linear Regression and Discrimination 11 / 21

Solution using pseudo-inverse I (halved) sum-of-squares error of linear model has derivative 1 w 2 l i=1 setting to 0 yields { y i w T φ(x i ) 0 T = } 2 = l i=1 l y i φ(x i ) T w T i=1 { } y i w T φ(x i ) φ(x i ) T l φ(x i )φ(x i ) T i=1 Christian Igel: Linear Regression and Discrimination 12 / 21

Solution using pseudo-inverse II l l y i φ(x i ) T w T φ(x i )φ(x i ) T = y T Φ w T Φ T Φ i=1 Φ := i=1 φ 1 (x 1 ) φ 2 (x 1 ) φ d+1 (x 1 ) φ 1 (x 2 ) φ 2 (x 2 ) φ d+1 (x 2 )............ φ 1 (x l ) φ 2 (x l ) φ d+1 (x l ) 0 T = y T Φ w T Φ T Φ w T Φ T Φ = y T Φ maximum likelihood estimate w = ( Φ T Φ) T w = Φ T y ( Φ T Φ) 1 Φ T y Φ := ( Φ T Φ ) 1 Φ T is called Moore-Penrose pseudo-inverse Christian Igel: Linear Regression and Discrimination 13 / 21

Outline Warmup: Gaussian distributions Linear Regression Sum-of-squares Error Solution Using Pseudo-Inverse Linear Discriminant Analysis Christian Igel: Linear Regression and Discrimination 14 / 21

Decision based on class posteriors Classification means assigning an input x X to one class of a finite set of classes Y = {C 1,..., C m }, 2 m <. One approach is to learn appropriate discrimination functions δ k : X R, 1 k m, and assign a pattern x to class ŷ by choosing ŷ = argmax k δ k (x). If we know the class posteriors Pr(Y X) we can perform optimal classification: a pattern x is assigned to class C k with maximum Pr(Y = C k X = x), i.e., ŷ = argmax k Pr(Y = C k X = x). Pr(Y = C k X = x) is proportional to the class-conditional density p(x = x Y = C k ) times the class prior Pr(Y = C k ): Pr(Y = C k X = x) = p(x = x Y = C k)pr(y = C k ) p(x = x) Christian Igel: Linear Regression and Discrimination 15 / 21

Decision based on class probabilities p(x, C 1 ) x 0 x p(x, C 2 ) x R 1 R 2 C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag, 2006 Christian Igel: Linear Regression and Discrimination 16 / 21

Gaussian class-conditionals let s consider ln Pr(Y = C k X = x) = ln p(x = x Y = C k ) + ln Pr(Y = C k ) + const for X = R d and Gaussian class-conditionals we have p(x = x Y = C k ) = N(x µ k,σ k,y = C k ) ln N(x µ k,σ k,y = C k ) = ( { 1 ln exp 1 k)}) (2π) d detσ k 2 (x µ k) T Σ 1 k (x µ = d 2 ln 2π 1 2 ln detσ k 1 2 (x µ k) T Σ 1 k (x µ k) = d 2 ln 2π 1 2 ln detσ k 1 2 xt Σ 1 k x 1 2 µt k Σ 1 k µ k + x T Σ 1 k µ k Christian Igel: Linear Regression and Discrimination 17 / 21

Linear Discriminant Analysis (LDA) let s assume that all class-conditionals have the same covariance matrix ln Pr(Y = C k X = x) const = ln Pr(Y = C k ) d 2 ln 2π 1 2 ln detσ 1 2 xt Σ 1 x 1 2 µt k Σ 1 µ k + x T Σ 1 µ k we estimate δ k (x) = x T Σ 1 µ k 1 2 µt k Σ 1 µ k + ln Pr(Y = C k ) ˆPr(Y = C k ) = l k /l ˆµ k = 1 l k ˆΣ = 1 l m (x,y) S k x m k=1 (x,y) S k (x ˆµ k )(x ˆµ k ) T with S k = {(x,y) (x,y) S y = C k } and l k = S k Christian Igel: Linear Regression and Discrimination 18 / 21

Effect of learning the covariance 4 4 2 2 0 0 2 2 2 2 6 2 2 6 C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag, 2006 Christian Igel: Linear Regression and Discrimination 19 / 21

Linear and Quadratic Discriminant Analysis in LDA the decision boundaries {x δ i (x) = δ j (x)} (i.e., the hypotheses) between two classes i and j are linear functions modeling independent covariance matrices for the class-conditionals leads to quadratic discriminant analysis (QDA) with quadratic decision functions : δ k (x) = 1 2 ln detσ k 1 2 xt Σ 1 k x 1 2 µt k Σ 1 k µ k + x T Σ 1 k µ k + ln Pr(Y = C k ) Christian Igel: Linear Regression and Discrimination 20 / 21

Linear Regression and LDA One could determine the δ k by linear regression. If l 1 = l 2 and Y = { 1,1} LDA and linear regression come to the same boundaries, if l 1 l 2 the cutpoint of the decision boundary is shifted. For more than two classes, using std. linear regression for classification is not advisable. Both linear regression and LDA should be considered as baseline algorithms for regression and classification! In particular, LDA often gives very good results! References: T. Hastie, R. Tibshirani, & J. Friedman The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, 2001 C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag, 2006 Christian Igel: Linear Regression and Discrimination 21 / 21