FMA901F: Machine Learning Lecture 4: Linear Models for Classification. Cristian Sminchisescu

Similar documents
10-701/ Machine Learning Mid-term Exam Solution

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Linear Classifiers III

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

10/2/ , 5.9, Jacob Hays Amit Pillay James DeFelice

Support vector machine revisited

Chapter 7. Support Vector Machine

Introduction to Machine Learning DIS10

Machine Learning for Data Science (CS 4786)

Lecture 7: Linear Classification Methods

6.867 Machine learning

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

Introduction to Optimization Techniques. How to Solve Equations

Classification with linear models

Machine Learning for Data Science (CS 4786)

Statistical Pattern Recognition

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j.

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

Optimally Sparse SVMs

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Naïve Bayes. Naïve Bayes

6.867 Machine learning, lecture 7 (Jaakkola) 1

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Binary classification, Part 1

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

The z-transform. 7.1 Introduction. 7.2 The z-transform Derivation of the z-transform: x[n] = z n LTI system, h[n] z = re j

6.003 Homework #3 Solutions

Efficient GMM LECTURE 12 GMM II

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Support Vector Machines and Kernel Methods

Lecture 2 October 11

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

Mixtures of Gaussians and the EM Algorithm

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32

Supplemental Material: Proofs

Machine Learning. Ilya Narsky, Caltech

CSCI567 Machine Learning (Fall 2014)

Machine Learning Brett Bernstein

Pattern Classification, Ch4 (Part 1)

6. Kalman filter implementation for linear algebraic equations. Karhunen-Loeve decomposition

Intro to Learning Theory

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Pattern recognition systems Laboratory 10 Linear Classifiers and the Perceptron Algorithm

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

( ) (( ) ) ANSWERS TO EXERCISES IN APPENDIX B. Section B.1 VECTORS AND SETS. Exercise B.1-1: Convex sets. are convex, , hence. and. (a) Let.

Machine Learning Brett Bernstein

Optimization Methods MIT 2.098/6.255/ Final exam

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Axis Aligned Ellipsoid

Vector Quantization: a Limiting Case of EM

Introduction to Artificial Intelligence CAP 4601 Summer 2013 Midterm Exam

Session 5. (1) Principal component analysis and Karhunen-Loève transformation

Distributional Similarity Models (cont.)

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

WEIGHTED LEAST SQUARES - used to give more emphasis to selected points in the analysis. Recall, in OLS we minimize Q =! % =!

Study the bias (due to the nite dimensional approximation) and variance of the estimators

Expectation-Maximization Algorithm.

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

L = n i, i=1. dp p n 1

Chapter 4. Fourier Series

Exponential Families and Bayesian Inference

Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3

Distributional Similarity Models (cont.)

Chapter 7 z-transform

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

6.883: Online Methods in Machine Learning Alexander Rakhlin

Algorithms for Clustering

LECTURE 17: Linear Discriminant Functions

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

Linear Regression Demystified

Statistical Inference Based on Extremum Estimators

Pattern recognition systems Lab 10 Linear Classifiers and the Perceptron Algorithm

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian

Discrete-Time Systems, LTI Systems, and Discrete-Time Convolution

a for a 1 1 matrix. a b a b 2 2 matrix: We define det ad bc 3 3 matrix: We define a a a a a a a a a a a a a a a a a a

Lesson 10: Limits and Continuity

Cov(aX, cy ) Var(X) Var(Y ) It is completely invariant to affine transformations: for any a, b, c, d R, ρ(ax + b, cy + d) = a.s. X i. as n.

Differentiable Convex Functions

MCT242: Electronic Instrumentation Lecture 2: Instrumentation Definitions

Machine Learning. Logistic Regression -- generative verses discriminative classifier. Le Song /15-781, Spring 2008

CSCI567 Machine Learning (Fall 2014)

1 Review and Overview

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

Solution of Final Exam : / Machine Learning

Chapter 7: The z-transform. Chih-Wei Liu

Lecture 2: Monte Carlo Simulation

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Lecture 20. Brief Review of Gram-Schmidt and Gauss s Algorithm

Recitation 4: Lagrange Multipliers and Integration

TEACHER CERTIFICATION STUDY GUIDE

1 Review and Overview

The Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model

Problem Set 2 Solutions

Transcription:

FMA90F: Machie Learig Lecture 4: Liear Models for Classificatio Cristia Smichisescu

Liear Classificatio Classificatio is itrisically o liear because of the traiig costraits that place o idetical iputs i the same class Differeces i the iput vector sometimes causes 0 chage i the aser Liear classificatio meas that the adaptive part is liear The adaptive part is cascaded ith a fixed o liearity It may also be preceded by a fixed o liearity he oliear basis fuctios are used fixed o liear fuctios T y x x 0, c f y x adaptive liear parameters decisio

Approach : Discrimiat Fuctio Use discrimiat fuctios directly, ad do ot compute probabilities Covert the iput vector ito oe or more real values so that a simple process threshholdig, or a majority vote ca be applied to assig the iput to the class The real values should be chose to maximize the useable iformatio about the class label preset i the real value Give discrimiat fuctios,, Classify as class, iff,

Approach : Class coditioal Probabilities Ifer coditioal class probabilities Use coditioal distributio to make optimal decisios, e.g. by miimizig some loss fuctio Example, classes, here exp

Approach 3: Class Geerative Model Compare the probability of the iput uder separate, classspecific, geerative models Model both the class coditioal desities, ad the prior class probabilities Compute posterior usig Bayes theorem class coditioal desity class prior posterior for class = Example: fit a multivariate Gaussia to the iput vectors correspodig to each class, model class prior probabilities by traiig data frequecy couts, ad see hich Gaussia makes a test data vector most probable usig Bayes theorem

Differet Types of Plots i the Course Weight space Each axis correspods to a eight A poit is a eight vector Dimesioality = #eights + extra dimesio for the loss Data space Each axis correspods to a iput value A poit is a data vector. A decisio surface is a plae. Dimesioality = dimesioality of a data vector Case space used for the geometric iterpretatio of least squares, L3 Each axis correspods to a traiig case Dimesioality = #traiig cases

class case: The decisio surface i data space for the liear discrimiat fuctio T y x x 0 is orthogoal to ay vector hich lies o the decisio surface, 0 cotrols the orietatio of the decisio surface 0 x

Represet Target Values: Biary vs. Multiclass To classes N=: typically use a sigle real valued output that has target values of for the positive class ad 0 or for the egative class For probabilistic class labels, the target ca be the probability of the positive class ad the output of the model ca be the probability the model assigs to the positive class For the multiclass N>, e use a vector of N target values cotaiig a sigle for the correct class ad zeros elsehere For probabilistic labels e ca the use a vector of class probabilities as the target vector

Discrimiat Fuctios for Multiclass Oe possibility is to use biary ay discrimiats Each fuctio separates oe class from the rest Aother possibility is to use biary ay discrimiats Each fuctio discrimiates betee to specific classes. We have discrimiat for each class pair Both methods have ambiguities

Problems ith Multi class Discrimiat Fuctios Costructed from Biary Classifiers vs. all vs. If e base our decisio o biary classifiers, e ca ecouter ambiguities

Simple Solutio Use discrimiat fuctios,,,,, ad take the max over their respose Cosider liear discrimiats The decisio boudary betee class ad is give by the hyperplae 0 I this liear case the decisio regios are covex,,,0 From the liearity of But y k x A y j x A ad yk xb y j xb Hece is covex also lies iside

Least Squares for Classificatio This is ot ecessarily the right approach i priciple, ad it does ot ork as ell as more advaced methods, but is simple It reduces classificatio to least squares regressio We already ko ho to do regressio. We ca solve for the optimal eights usig the ormal equatios L3 We set the target to be the coditioal probability of the class give iput Whe more tha to classes, e treat each class as a separate problem The justificatio for usig least squares is that it approximates the coditioal expectatio. For the biary codig scheme, this expectatio is give by the vector of posterior probabilities. Ufortuately these are approximated rather poorly e.g. values outside the rage 0,, due to the limited flexibility of the model

Least Squares Classificatio Assume each class has its o liear model: The e ca rite:, ith th colum a dim vector,,, Give,,,, ; ro of is ; s ro is The sum of squares error fuctio for classificatio is: Tr } 0 is the pseudoiverse of Closed form solutio Property: every vector i the traiig set ad the model predictio for ay value of, satisfy some liear costrait: 0, 0, for some costats,.

Problems ith usig least squares for classificatio logistic regressio least squares regressio Least squares solutios lack robustess to outliers If the right aser is ad the model says.5, it loses, so it chages the boudary to avoid beig too correct

For o Gaussia targets, least squares regressio gives poor decisio surfaces Least Squares Logistic Regressio Remember that least squares correspods to the Maximum Likelihood uder a Gaussia coditioal distributio Clearly the biary target vectors have a distributio that is far from Gaussia

Fisher s Liear Discrimiat We ca vie classificatio i terms of dimesioality reductio A simple liear discrimiat fuctio is a projectio of the dimesioal data do to dimesio Project: ; Classify: if the else Hoever projectio results i loss of iformatio. Classes ell separated i the origial iput space may strogly overlap i d We ill adjust the projectio eight vector to achieve the best separatio amog classes. But hat do e mea by best separatio?

Fisher s Vie of Class Separatio I The simplest measure of class separatio he projected oto is the separatio of the projected class meas. This suggests choosig so to miimize,,, This ca be made arbitrarily large by icreasig. We could hadle this by imposig uit orm costraits usig Lagrage multipliers. We get max, s.th. Hoever, still, if the mai directio of variace i each class is ot orthogoal to the directio betee meas, e ill ot get good separatio see ext slide

Advatage of usig Fisher s Criterio Whe projected oto the lie joiig the class meas, the classes are ot ell separated Fisher chooses a directio that makes the projected classes much tighter, eve though their projected meas are less far apart

Fisher s Vie of Class Separatio II Fisher: maximize a fuctio that gives a large separatio betee the projected class meas, hile also givig a small variace ithi each class, thereby miimizig class overlap Choose directio maximizig the ratio of betee class variace to ithi class variace This is the directio i hich the projected poits cotai the most iformatio about class membership uder Gaussia assumptios

Fisher s Liear Discrimiat We seek a liear trasformatio that is best for discrimiatio y T x The projectio oto the vector separatig the class meas seems right m m But e also at small variace ithi each class, Fisher s objective fuctio J m s m s Betee class Withi class

solutio: Optimal here m m S m x m x m x m x S m m m m S S S W C T C T W T B W T B T s s m m J Fisher s Liear Discrimiat Derivatios lx scalar scalar The above result is ko as Fischer s liear discrimiat. Strictly it is ot a discrimiat, but rather a directio of projectio that ca be used for classificatio i cojuctio ith a decisio e.g. thresholdig operatio.

Fischer s Liear Discrimiat Computatio Hoever, the objective is ivariat to rescalig. We ca chose the deomiator to be uity. We ca the miimize mi This correspods to the primal Lagragia From the KKT coditios Geeralized eigevalue problem, as ot symmetric

Fischer s Liear Discrimiat Computatio Give that is symmetric positive defiite, e ca rite / / here, / / Defiig /, e get / / We have to solve a regular eigevalue problem for a symmetric, positive defiite matrix / / We ca fid solutios ad correspodig to / Which eigevector ad eigevalue should e chose? The largest! Why? Trasformig to dual, cost. eed to maximize over

The Perceptro Model cca. 96 Liear discrimiat model Iput vector is first mapped usig a fixed o liear trasformatio, to give a feature vector, the used to costruct liear model here, 0, 0 Typically use for class, for Feature vector icludes a bias compoet

Perceptro Criteria I Perceptro s algorithm ca be motivated by error fuctio miimizatio A atural error ould be the umber of misclassified patters Hoever this does ot lead to a simple learig algorithm, because the error is a pieceise fuctio of Discotiuities heever a chage i causes the decisio boudary to move across oe of the datapoits Gradiet methods caot be immediately applied, as the gradiet is zero almost everyhere

Perceptro Criteria II Patters i class ill have Patters i class ill have Target codig Hece e ould like all patters to satisfy The perceptro associates error to correctly classified patters, hereas for a misclassified patter, it tries to miimize the quatity

Perceptro Criteria III The perceptro criterio is give by here is the set of misclassified examples By applyig stochastic gradiet descet = Sice perceptio s fuctio is ivariat to the rescalig of, e ca set As chages, so ill the set of misclassified patters

Algorithm We cycle through the traiig patters i tur For each patter e evaluate the perceptro fuctio output If the patter is correctly classified, the the eight vector remais uchaged If the patter is icorrectly classified For class e add vector to the curret estimate of the eight vector For class C e subtract vector from the curret estimate of the eight vector

Weight ad Data Space Imagie a space i hich each axis correspods to a feature value or to the eight o that feature A poit i this space is a eight vector. Feature vectors are sho i blue traslated aay from the origi to reduce clutter. Each traiig case defies a plae. O oe side of the plae the output is rog. To get all traiig cases right e eed to fid a poit o the right side of all the plaes. This feasible regio if it exists is a coe ith its tip at the origi A feature vector ith correct aser= good eights A feature vector ith correct aser=0 bad eights o the origi Slide from Hito

Perceptro s Covergece Cotributio to error fuctio from a misclassified patter is reduced Hoever, this does ot imply that cotributios from other misclassified patters ill have bee reduced The perceptro rule is ot guarateed to reduce the total error fuctio at each stage Novikoff 96 proved that the perceptro algorithm coverges after a fiite umber of iteratios, if the data set is liearly separable The eight vector is alays adjusted by a bouded amout i a directio it has a egative dot product ith, ad thus ca be bouded above by here is the umber of chages to. But it ca also be bouded belo by because if there exists a uko feasible, the every chage makes progress i this directio by a positive amout that depeds oly o the iput vector. This ca be used to sho that the umber of updates to the eight vector is bouded by, here is the maximum orm of a iput vector.

Summary: Perceptro s Covergece Perceptro s covergece theorem: if there exists a exact solutio data is liearly separable, the the perceptro algorithm is guarateed to fid a exact solutio i a fiite umber of steps The umber of steps could be very large, though Util covergece e caot distiguish betee a o separable problem, or oe that is just slo to coverge Eve for liearly separable data, there may be may solutios, depedig o the parameter iitializatio ad the order i hich datapoits are preseted

Perceptro at Work

Other Issues ith the Perceptro Does ot provide probabilistic outputs Does ot geeralize readily to more tha classes Is based o liear combiatios of fixed basis fuctios

What Perceptros Caot Lear The adaptive part of a perceptro caot eve tell if to sigle bit features have the same value! Same:, ; 0,0 Differet:,0 0; 0, 0 0, Data Space, The four feature output pairs give four iequalities that are impossible to satisfy:,, 0 0,0,0 The positive ad egative cases caot be separated by a plae Slide from Hito

The Logistic Sigmoid due to S shape This is also called a squashig fuctio because it maps the etire real axis ito a fiite iterval For classificatio, the output is a smooth fuctio of the iputs ad the eights Properties, l y 0.5 logit fuctio a y a a dy da i T x x i y 0 y e a a x i i 0 0 a

Probabilistic Geerative Models Use a class prior ad a separate geerative model of the iput vectors for each class, ad compute hich model makes a test iput vector most probable The posterior probability of class is give by: l l x x x x x x x x C p C p C p C p C p C p a here e C p C p C p C p C p C p C p a z is called the logit ad is give by the log odds Logistic sigmoid

Multiclass Model Softmax here l This is ko as the ormalized expoetial Ca be vieed as a multiclass geeralizatio of the logistic sigmoid It is also called a softmax fuctio it is a smoothed versio of `max if, the ad 0

Gaussia Class Coditioals Assume that the iput vectors for each class are from a Gaussia distributio, ad all classes have the same covariace matrix. The class coditioals are For to classes, ad, the posterior turs out to be a logistic exp / k T k C k Z p μ x μ x x l 0 0 C p C p C p T T T μ Σ μ μ Σ μ μ μ Σ x x iverse covariace matrix ormalizer Quadratic terms caceled due to commo covariace

Iterpretatio of Decisio Boudaries Quadratic terms caceled due to commo covariace The sigmoid takes a liear fuctio of as argumet The decisio boudaries correspod to surfaces alog hich the posteriors are costat, so they ill be give by liear fuctios of. Thus, decisio boudaries are liear fuctios i iput space The prior probabilities eter oly through the bias parameter, so chages i priors have the effect of makig parallel shifts of the decisio boudary more geerally of the parallel cotours of costat posterior probability l 0 0 C p C p C p T T T μ Σ μ μ Σ μ μ μ Σ x x

A picture of the to Gaussia models ad the resultig posterior for the red class, The logistic sigmoid i the right had plot is coloured usig a proportio of red toe give by ad a proportio of blue toe give by.

Class posteriors he covariace matrices are differet for differet classes The decisio surface is plaar he the covariace matrices are the same ad quadratic he they are ot

Effect of usig Basis Fuctios Ceters of Gaussia basis fuctios ad ith gree iso cotours Liear decisio boudary logistic regressio i feature space Decisio boudary iduced i iput space

Probabilistic Discrimiative Models Logistic Regressio I our discussio of geerative approaches, e sa that uder geeral assumptios, the class posterior for ca be ritte as a logistic sigmoid actig o a liear fuctio of the feature vector I logistic regressio, e use the fuctioal form of the geeralized liear model explicitly, here exp Feer adaptive parameters compared to the geerative model For dimesioal feature space Discrimiative: parameters Geerative: parameters for the meas + shared! covariace total parameters Quadratic versus liear umber of parameters! parameters for

Maximum Likelihood for Logistic Regressio ; For dataset,, ith 0,,,,, exp,,,,,, l l Cross etropy error l Similar form as the gradiet of the sum of squares regressio model

Iterative Reeighted Least Squares The Neto Raphso update Logistic model is the x desig matrix ith th ro here,0 ; The 0, It follos that Normal equatios ith o costat eightig matrix here

Logistic Regressio Chai Rule for Error Derivatives T t y a da dy y E E y y da dy y y t y y E a a, 0, l l l N N y y t y y t y t y E y t y t y t p E

Facts o IRLS, The eightig matrix is ot costat, but the Hessia is positive defiite This meas that e have to iterate to fid the solutio, but the likelihood fuctio is cocave i. We have a uique optimum The th compoet of ca be iterpreted as a effective target value obtaied by makig a local liear approximatio to the logistic sigmoid aroud the curret operatig poit The elemets of the diagoal eightig matrix ca be iterpreted as variaces We ca iterpret IRLS as the solutio to a liearized problem i the space of the variable the sigmoid argumet

Readigs Bishop Ch. 4, up to 4.3.4