Lecture 7: Linear Classification Methods

Similar documents
Lecture 7: Linear Classification Methods

Classification with linear models

Machine Learning. Logistic Regression -- generative verses discriminative classifier. Le Song /15-781, Spring 2008

FMA901F: Machine Learning Lecture 4: Linear Models for Classification. Cristian Sminchisescu

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

Chapter 7. Support Vector Machine

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

( ) = is larger than. the variance of X V

Tomoki Toda. Augmented Human Communication Laboratory Graduate School of Information Science

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Hypothesis Testing. H 0 : θ 1 1. H a : θ 1 1 (but > 0... required in distribution) Simple Hypothesis - only checks 1 value

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian

WEIGHTED LEAST SQUARES - used to give more emphasis to selected points in the analysis. Recall, in OLS we minimize Q =! % =!

Special Modeling Techniques

Three classification models Discriminant Model: learn the decision boundary directly and apply it to determine the class of each data point

Step 1: Function Set. Otherwise, output C 2. Function set: Including all different w and b

Questions and Answers on Maximum Likelihood

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Chapter 7. Transformation

PUTNAM TRAINING PROBABILITY

Elementary manipulations of probabilities

Inverse Matrix. A meaning that matrix B is an inverse of matrix A.

First Year Quantitative Comp Exam Spring, Part I - 203A. f X (x) = 0 otherwise

Logit regression Logit regression

Sample Size Estimation in the Proportional Hazards Model for K-sample or Regression Settings Scott S. Emerson, M.D., Ph.D.

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

15-780: Graduate Artificial Intelligence. Density estimation

Exponential Families and Bayesian Inference

A Note on Sums of Independent Random Variables

Chapter 18: Sampling Distribution Models

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Lecture 2 October 11

The Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model

Chapter 6: BINOMIAL PROBABILITIES

Bivariate Sample Statistics Geog 210C Introduction to Spatial Data Analysis. Chris Funk. Lecture 7

Elliptic Curves Spring 2017 Problem Set #1

September 2012 C1 Note. C1 Notes (Edexcel) Copyright - For AS, A2 notes and IGCSE / GCSE worksheets 1

Lecture 3: August 31

Classification of DT signals

18.01 Calculus Jason Starr Fall 2005

Pattern Classification

Machine Learning. Ilya Narsky, Caltech

10-701/ Machine Learning Mid-term Exam Solution

Estimation Theory Chapter 3

BIOSTATISTICAL METHODS FOR TRANSLATIONAL & CLINICAL RESEARCH

f(x i ; ) L(x; p) = i=1 To estimate the value of that maximizes L or equivalently ln L we will set =0, for i =1, 2,...,m p x i (1 p) 1 x i i=1

The Maximum-Likelihood Decoding Performance of Error-Correcting Codes

Nonequilibrium Excess Carriers in Semiconductors

Regression and generalization

ECE534, Spring 2018: Final Exam

PC5215 Numerical Recipes with Applications - Review Problems

Optimally Sparse SVMs

Basics of Inference. Lecture 21: Bayesian Inference. Review - Example - Defective Parts, cont. Review - Example - Defective Parts

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

EGN 3353C Fluid Mechanics

Session 5. (1) Principal component analysis and Karhunen-Loève transformation

MA Advanced Econometrics: Properties of Least Squares Estimators

Vector Quantization: a Limiting Case of EM

5.4 The spatial error model Regression model with spatially autocorrelated errors

Factor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis

ECO 312 Fall 2013 Chris Sims LIKELIHOOD, POSTERIORS, DIAGNOSING NON-NORMALITY

Output Analysis (2, Chapters 10 &11 Law)

Estimation for Complete Data

NANYANG TECHNOLOGICAL UNIVERSITY SYLLABUS FOR ENTRANCE EXAMINATION FOR INTERNATIONAL STUDENTS AO-LEVEL MATHEMATICS

BHW #13 1/ Cooper. ENGR 323 Probabilistic Analysis Beautiful Homework # 13

Nonlinear regression

Machine Learning: Logistic Regression. Lecture 04

Putnam Training Exercise Counting, Probability, Pigeonhole Principle (Answers)

Naïve Bayes. Naïve Bayes

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

CHAPTER I: Vector Spaces

6.867 Machine learning, lecture 7 (Jaakkola) 1

Review of Probability Axioms and Laws

Efficient GMM LECTURE 12 GMM II

Polynomial Functions and Their Graphs

Topics Machine learning: lecture 3. Linear regression. Linear regression. Linear regression. Linear regression

Least Squares Methods

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Cov(aX, cy ) Var(X) Var(Y ) It is completely invariant to affine transformations: for any a, b, c, d R, ρ(ax + b, cy + d) = a.s. X i. as n.

tests 17.1 Simple versus compound

Mini Lecture 10.1 Radical Expressions and Functions. 81x d. x 4x 4

Mixtures of Gaussians and the EM Algorithm

Pattern Classification

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Introduction to Machine Learning DIS10

L S => logf y i P x i ;S

Generalizing the DTFT. The z Transform. Complex Exponential Excitation. The Transfer Function. Systems Described by Difference Equations

ECON 3150/4150, Spring term Lecture 3

Expectation and Variance of a random variable

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Correlation Regression

Mth 95 Notes Module 1 Spring Section 4.1- Solving Systems of Linear Equations in Two Variables by Graphing, Substitution, and Elimination

Properties and Hypothesis Testing

Lecture 8: Non-parametric Comparison of Location. GENOME 560, Spring 2016 Doug Fowler, GS

RADICAL EXPRESSION. If a and x are real numbers and n is a positive integer, then x is an. n th root theorems: Example 1 Simplify

Transcription:

Homeork

Homeork

Lecture 7: Liear lassificatio Methods Fial rojects? Grous Toics Proosal eek 5 Lecture is oster sessio, Jacobs Hall Lobb, sacks Fial reort 5 Jue.

What is liear classificatio? lassificatio is itrisicall oliear It uts oidetical thigs i the same class, so a differece i iut vector sometimes causes ero chage i the aser Liear classificatio meas that the art that adats is liear The adative art is folloed b a fied oliearit. It ma be receded b a fied oliearit e.g. oliear basis fuctios. T +, Decisio f adative liear fuctio fied oliear fuctio.5

Reresetig the target values for classificatio For to classes, e use a sigle valued outut that has target values for the ositive class ad or for the other class For robabilistic class labels the target value ca the be Pt ad the model outut ca also rereset P. For N classes e ofte use a vector of N target values cotaiig a sigle for the correct class ad eros elsehere. For robabilistic labels e ca the use a vector of class robabilities as the target vector.

Three aroaches to classificatio Use discrimiat fuctios directl ithout robabilities: overt iut vector ito real values. A simle oeratio like thresholdig ca get the class. hoose real values to maimie the useable iformatio about the class label that is i the real value. Ifer coditioal class robabilities: class k omute the coditioal robabilit of each class. The make a decisio that miimies some loss fuctio omare the robabilit of the iut uder searate, classsecific, geerative models. E.g. fit a multivariate Gaussia to the iut vectors of each class ad see hich Gaussia makes a test data vector most robable. Is this the best bet?

Discrimiat fuctios The laar decisio surface i datasace for the simle liear discrimiat fuctio: T + ³ X o lae > > Distace from lae

Discrimiat fuctios for N> classes Oe ossibilit is to use N toa discrimiat fuctios. Each fuctio discrimiates oe class from the rest. Aother ossibilit is to use NN/ toa discrimiat fuctios Each fuctio discrimiates betee to articular classes. Both these methods have roblems More tha oe good aser Toa refereces eed ot be trasitive!

Use N discrimiat fuctios, ad ick the ma., k k A simle solutio 4.. i, j k... This is guarateed to give cosistet ad cove decisio regios if is liear. A imlies a + a > a + a A > j for A ad ositive a B k that j B A > j B B Decisio boudar?

Maimum Likelihood ad Least Squares from lecture 3 omutig the gradiet ad settig it to ero ields Solvig for, here The MoorePerose seudoiverse,.

LSQ for classificatio Each class k is described b its o liear model so that k T k + k 4.3 here k,...,k. We ca coveietl grou these together usig vector otatio so that WT 4.4 osider a traiig set {" #, $ # }, ' N Defie X ad T { } { } LSQ solutio: W XT X XT T X T 4.6 Ad redictio X WT T T X T. 4.7

Usig least squares for classificatio It does ot ork as ell as better methods, but it is eas: It reduces classificatio to least squares regressio. logistic regressio least squares regressio

PA do t ork ell

icture shoig the advatage of Fisher s liear discrimiat Whe rojected oto the lie joiig the class meas, the classes are ot ell searated. Fisher chooses a directio that makes the rojected classes much tighter, eve though their rojected meas are less far aart.

Math of Fisher s liear discrimiats What liear trasformatio is best for discrimiatio? The rojectio oto the vector searatig the class meas seems sesible: T µ m m But e also at small variace ithi each class: s s å e å e m m Fisher s objective fuctio is: J m s m + s betee ithi

: m m S m m m m S m m m m S S S µ + + Î Î å å W T T W T B W T B T solutio Otimal s s m m J More math of Fisher s liear discrimiats

We have robalistic classificatio!

Probabilistic Geerative Models for Discrimiatio Bisho 96 Use a geerative model of the iut vectors for each class, ad see hich model makes a test iut vector most robable. The osterior robabilit of class is: l l here e + + is called the logit ad is give b the log odds

A eamle for cotiuous iuts Assume iut vectors for each class are Gaussia, all classes have the same covariace matri. For to classes, ad, the osterior is a logistic: { } e k T k k a µ µ S l T T T + + + µ Σ µ µ Σ µ µ µ Σ s iverse covariace matri ormaliig costat

! #$ % & % % & * % *

The role of the iverse covariace matri If the Gaussia is sherical o eed to orr about the covariace matri. So, start b trasformig the data sace to make the Gaussia sherical This is called hiteig the data. It remultilies b the matri square root of the iverse covariace matri. I trasformed sace, the eight vector is the differece betee trasformed meas. Σ gives the for aff ad T aff gives for same value as : Σ µ µ µ Σ T aff Σ aff µ

The osterior he the covariace matrices are differet for differet classes Bisho Fig The decisio surface is laar he the covariace matrices are the same ad quadratic he ot.

Beroulli distributio Radom variable!, oi fliig: heads, tails Beroulli Distributio ML for Beroulli Give:

The logistic fuctio The outut is a smooth fuctio of the iuts ad the eights. d d e i i i i T + + s.5 Its odd to eress it i terms of.

! " # $ & $ Observatios Likelihood & $! $,! 4 $, Loglikelihood Miimie log like Derivative Logistic regressio Bisho 5 EF! 4 $,

Logistic regressio age 5 Whe there are ol to classes e ca model the coditioal robabilit of the ositive class as T s + here s + e If e use the right error fuctio, somethig ice haes: The gradiet of the logistic ad the gradiet of the error fuctio cacel each other: E l t, ÑE å t N

The atural error fuctio for the logistic Fittig logistic model usig maimum likelihood, requires miimiig the egative log robabilit of the correct aser summed over the traiig set. l l l N N t t t E t t t E + + å å error derivative o traiig case if t if t

Usig the chai rule to get the error derivatives T t d d E E d d t E,, +

Softma fuctio For the case of K>classes, e have k k k j j j ea k j ea j 4.6 a k l k k. 4.63 l is also ko as the softma fuctio, as it reresets

rossetro or softma fuctio for multiclass classificatio i i j i j j i j j j i i i i j i t E E t E e e j i å å å l The outut uits use a olocal oliearit: The atural cost fuctio is the egative log rob of the right aser The steeess of E eactl balaces the flatess of the softma. outut uits 3 3 target value

A secial case of softma for to classes So the logistic is just a secial case that avoids usig redudat arameters: Addig the same costat to both ad has o effect. The overarameteriatio of the softma is because the robabilities must add to. e e e e + +