Linear Classification

Similar documents
COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

COMP 551 Applied Machine Learning Lecture 4: Linear classification

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

What is Statistical Learning?

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

Lecture 8: Multiclass Classification (I)

Pattern Recognition 2014 Support Vector Machines

Simple Linear Regression (single variable)

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

Part 3 Introduction to statistical classification techniques

The blessing of dimensionality for kernel methods

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Distributions, spatial statistics and a Bayesian perspective

Chapter 3: Cluster Analysis

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

Tree Structured Classifier

Math Foundations 20 Work Plan

IAML: Support Vector Machines

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

Support-Vector Machines

, which yields. where z1. and z2

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

Chapter 15 & 16: Random Forests & Ensemble Learning

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

Checking the resolved resonance region in EXFOR database

7 TH GRADE MATH STANDARDS

MATHEMATICS SYLLABUS SECONDARY 5th YEAR

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

Support Vector Machines and Flexible Discriminants

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards:

Slide04 (supplemental) Haykin Chapter 4 (both 2nd and 3rd ed): Multi-Layer Perceptrons

Stats Classification Ji Zhu, Michigan Statistics 1. Classification. Ji Zhu 445C West Hall

Resampling Methods. Chapter 5. Chapter 5 1 / 52

Elements of Machine Intelligence - I

Contents. This is page i Printer: Opaque this

Data Mining: Concepts and Techniques. Classification and Prediction. Chapter February 8, 2007 CSE-4412: Data Mining 1

CS 109 Lecture 23 May 18th, 2016

Support Vector Machines and Flexible Discriminants

Data mining/machine learning large data sets. STA 302 or 442 (Applied Statistics) :, 1

The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition

Data Mining with Linear Discriminants. Exercise: Business Intelligence (Part 6) Summer Term 2014 Stefan Feuerriegel

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

Linear Methods for Regression

Linear programming III

the results to larger systems due to prop'erties of the projection algorithm. First, the number of hidden nodes must

Math Foundations 10 Work Plan

COMP9444 Neural Networks and Deep Learning 3. Backpropagation

[COLLEGE ALGEBRA EXAM I REVIEW TOPICS] ( u s e t h i s t o m a k e s u r e y o u a r e r e a d y )

5 th grade Common Core Standards

Agenda. What is Machine Learning? Learning Type of Learning: Supervised, Unsupervised and semi supervised Classification

Maximum A Posteriori (MAP) CS 109 Lecture 22 May 16th, 2016

initially lcated away frm the data set never win the cmpetitin, resulting in a nnptimal nal cdebk, [2] [3] [4] and [5]. Khnen's Self Organizing Featur

STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours

NUMBERS, MATHEMATICS AND EQUATIONS

Least Squares Optimal Filtering with Multirate Observations

Preparation work for A2 Mathematics [2018]

Multiple Source Multiple. using Network Coding

Churn Prediction using Dynamic RFM-Augmented node2vec

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

Early detection of mining truck failure by modelling its operation with neural networks classification algorithms

Logistic Regression. and Maximum Likelihood. Marek Petrik. Feb

CN700 Additive Models and Trees Chapter 9: Hastie et al. (2001)

Lecture 10, Principal Component Analysis

ENSC Discrete Time Systems. Project Outline. Semester

Overview of Supervised Learning

Computational modeling techniques

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

Discussion on Regularized Regression for Categorical Data (Tutz and Gertheiss)

Statistical classifiers: Bayesian decision theory and density estimation

Smoothing, penalized least squares and splines

Differentiation Applications 1: Related Rates

24 Multiple Eigenvectors; Latent Factor Analysis; Nearest Neighbors

February 28, 2013 COMMENTS ON DIFFUSION, DIFFUSIVITY AND DERIVATION OF HYPERBOLIC EQUATIONS DESCRIBING THE DIFFUSION PHENOMENA

Online Model Racing based on Extreme Performance

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

Determining the Accuracy of Modal Parameter Estimation Methods

Chapter 3 Kinematics in Two Dimensions; Vectors

Preparation work for A2 Mathematics [2017]

Functional Form and Nonlinearities

making triangle (ie same reference angle) ). This is a standard form that will allow us all to have the X= y=

Kinetics of Particles. Chapter 3

A Matrix Representation of Panel Data

Dead-beat controller design

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

Statistics, Numerical Models and Ensembles

COMP9414/ 9814/ 3411: Artificial Intelligence. 14. Course Review. COMP3411 c UNSW, 2014

Hypothesis Tests for One Population Mean

Performance Bounds for Detect and Avoid Signal Sensing

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

MODULE 1. e x + c. [You can t separate a demominator, but you can divide a single denominator into each numerator term] a + b a(a + b)+1 = a + b

Emphases in Common Core Standards for Mathematical Content Kindergarten High School

22.54 Neutron Interactions and Applications (Spring 2004) Chapter 11 (3/11/04) Neutron Diffusion

NAME: Prof. Ruiz. 1. [5 points] What is the difference between simple random sampling and stratified random sampling?

EDA Engineering Design & Analysis Ltd

Transcription:

Linear Classificatin CS 54: Machine Learning Slides adapted frm Lee Cper, Jydeep Ghsh, and Sham Kakade

Review: Linear Regressin CS 54 [Spring 07] - H

Regressin Given an input vectr x T = (x, x,, xp), we want t predict the quantitative respnse Y Linear regressin frm: f(x) = 0 + Least squares prblem: px i= x i i min(y X ) > (y X ) =) ˆ =(X > X) X > y CS 54 [Spring 07] - H

Feature Selectin Brute frce is infeasible fr large number f features Algrithms Best subset selectin beynd 40 features is impractical Stepwise selectin (frward and backward) CS 54 [Spring 07] - H

Regularizatin Add penalty term n mdel parameters t achieve a mre simple mdel r reduce sensitivity t training data Less prne t verfitting min L(X, y)+ penalty( ) CS 54 [Spring 07] - H

CS 54 [Spring 07] - H Ridge & Lass Regularizatin Cefficients 0 4 6 8 0. 0.0 0. 0.4 0.6 lcavl lweight age lbph svi lcp gleasn pgg45 df(λ) 0.0 0. 0.4 0.6 0.8.0 0. 0.0 0. 0.4 0.6 Shrinkage Factr s Cefficients lcavl lweight age lbph svi lcp gleasn pgg45 Ridge Lass Figures.8 &.0 (Hastie et al.)

Thus far, regressin: Predict a cntinuus value given sme inputs r features CS 54 [Spring 07] - H

Linear Classificatin CS 54 [Spring 07] - H

Linear Classifiers: Spam Filtering spam vs nt spam CS 54 [Spring 07] - H

Linear Classifiers: Weather Predictin CS 54 [Spring 07] - H

Ntatin Number f classes: K A specific class: k Set f classes: G Prir prbability f class k: k =Pr(G = k) KX j = j= CS 54 [Spring 07] - H

Bayes Decisin Thery CS 54 [Spring 07] - H

Statistical Decisin Thery Revisited Natural rule f classificatin: f(x) = argmax j=,...,k Pr(G = k X = x) Applicatin f Bayes rule: Pr(X = x G = k)pr(g = k) Pr(G = k X = x) = Pr(X = x) Since denminatr same acrss all classes f(x) = argmax j=,...,k Pr(X = x G = k) k CS 54 [Spring 07] - H

Classificatin Evaluatin CS 54 [Spring 07] - H

Misclassificatin Rate ptimal decisin bundary Pr(x,G ) red area means it is an nn-ptimal design Pr(x,G ) Pr(mistake) = Z R Pr(x,G )dx + CS 54 [Spring 07] - H Z R Pr(x,G )dx

Cnfusin Matrix & Metrics CS 54 [Spring 07] - H

Prblems with Accuracy Assumes equal cst fr bth types f errr FN = FP Is 99% accuracy? Depends n the prblem and the dmain Cmpare t the base rate (i.e., predicting predminant class) CS 54 [Spring 07] - H

Receiver Operating Characteristic Curve AUC = area under ROC curve https://en.wikipedia.rg/wiki/receiver_perating_characteristic CS 54 [Spring 07] - H

ROC Curves Slpe is always increasing Each pint represents different tradeff (cst rati) between FP and FN Tw nn-intersecting curves means ne methd dminates the ther Tw intersecting curves means ne methd is better fr sme cst ratis, and ther methd is better fr ther cst ratis CS 54 [Spring 07] - H

Area Under ROC Curve (AUC) > 0.9: excellent predictin smething ptentially fishy, shuld check fr infrmatin leakage 0.8: gd predictin 0.5: randm predictin <0.5: smething wrng! AUC is mre rbust t class imbalanced situatin CS 54 [Spring 07] - H

Discriminant Analysis CS 54 [Spring 07] - H

Bayes Classifier MAP classifier (maximum a psterir) Outcme: partitining f the input space Classifier is ptimal: statistically minimizes the errr rate Why nt use Bayes classifier all the time? CS 54 [Spring 07] - H

Discriminant Functins Each class has a discriminant functin: k(x) Classify accrding t the best discriminant: Ĝ(x) = argmax j=,...,k k(x) Can be frmulated in terms f prbabilities Ĝ(x) = argmax j=,...,k Pr(G = k X = x) CS 54 [Spring 07] - H

Discriminant Analysis Bayes rule: Pr(G X)Pr(X) =Pr(X G)Pr(G) Applicatin f Bayes therem: Pr(G = k X = x) = f k (x) k P K`= f`(x) ` Use lg-rati fr a tw class prblem: lg Pr(G = k X = x) Pr(G = ` X = x) = lg f k(x) f`(x) + lg k ` CS 54 [Spring 07] - H

Linear Regressin Classifier Each respnse categry cded as indicatr variable Fit linear regressin mdel t each clumn f respnse indicatr matrix simultaneusly Cmpute the fitted utput and classify accrding t largest cmpnent Serius prblems ccurs when number f classes greater than r equal t! CS 54 [Spring 07] - H

Linear Discriminant Analysis (LDA) Assume each class density is frm a multivariate Gaussian f k (x) = ( ) p/ exp k / (x µ k) > k (x µ k) LDA assumes class have cmmn cvariance matrix Discriminant functin: k(x) =x > µ k µ> k µ k + lg k CS 54 [Spring 07] - H

LDA Decisin Bundaries True distributins with Estimated bundaries same cvariance and different means Figure 4.5 (Hastie et al.) CS 54 [Spring 07] - H

CS 54 [Spring 07] - H LDA vs Linear Regressin Figure 4. (Hastie et al.) Linear Regressin Linear Discriminant Analysis X X X X

Quadratic Discriminant Analysis (QDA) What if the cvariances are nt equal? Quadratic discriminant functins: k(x) = lg k (x µ k) > k (x µ k) + lg k Quadratic decisin bundary Cvariance matrix must be estimated fr each class CS 54 [Spring 07] - H

CS 54 [Spring 07] - H LDA vs. QDA Decisin Bundaries Figure 4. (Hastie et al.) LDA QDA

Gaussian Parameter Values In practice, the parameters f multivariate nrmal distributin are unknwn Estimate using the training data Prir distributin ˆ k = N k /N Mean ˆµ k = X g i =k x i /N k Variance = KX X (x i ˆµ k )(x i ˆµ k ) > /(N K) k= g i =k CS 54 [Spring 07] - H

Regularized Discriminant Analysis Cmprmise between LDA and QDA Shrink separate cvariances f QDA twards cmmn cvariance like LDA Similar t ridge regressin ˆ k ( ) = ˆ k +( ) ˆ CS 54 [Spring 07] - H

Example: Vwel Data Experiment recrded 58 instances f spken wrds Wrds fall int classes ( vwels ) 0 features fr each instance CS 54 [Spring 07] - H

Regularized Discriminant Analysis Regularized Discriminant Analysis n the Vwel Data Misclassificatin Rate 0.0 0. 0. 0. 0.4 0.5 Test Data Train Data 0.0 0. 0.4 0.6 0.8.0 α Figure 4.7 (Hastie et al.) Optimum fr test ccurs clse t QDA CS 54 [Spring 07] - H

Reduced-rank LDA What if we want t further reduce the dimensin t L where L < K -? Why? Visualizatin Regularizatin sme dimensins may nt prvide gd separatin between classes but just nise CS 54 [Spring 07] - H

Fisher s Linear Discriminant Find prjectin that maximizes rati f between class variance t within class variance between within = (a> (µ µ )) a > ( + )a CS 54 [Spring 07] - H Figure 4.6 (Bishp)

Why Fisher Makes Sense Fllwing infrmatin is taken int accunt Spread f class centrids directin jining centrids separates the mean Shape f data defined by cvariance minimum verlap can be fund CS 54 [Spring 07] - H

Why Fisher Makes Sense: Graphically + + + + Prjected data maximizing between class nly Discriminant directin Figure 4.9 (Hastie et al.) CS 54 [Spring 07] - H

Vwel Data: -D Subspace Linear Discriminant Analysis Crdinate fr Training Data -6-4 - 0 4-4 - 0 4 Crdinate fr Training Data Figure 4.4 (Hastie et al.) CS 54 [Spring 07] - H

Vwel Data: Reduced-rank LDA CS 54 [Spring 07] - H Figure 4.0 (Hastie et al.)

CS 54 [Spring 07] - H Vwel Data: Reduced-rank LDA () Figure 4. (Hastie et al.) Cannical Crdinate Cannical Crdinate Classificatin in Reduced Subspace

Lgistic Regressin CS 54 [Spring 07] - H

Revisiting LDA fr Binary Classes LDA assumes predictrs are nrmally distributed lg Pr(G = k X = x) Pr(G = ` X = x) = lg k ` (µ k + µ`) > (µ k + µ`) + x > (µ k µ`) = 0 + > x Lg dds f class vs is a linear functin Why nt estimate cefficients directly? CS 54 [Spring 07] - H

Link Functins Hw t cmbine regressin and prbability? Use regressin t mdel the psterir Link functin Map frm real values t [0,] Need prbabilities t sum t CS 54 [Spring 07] - H

Lgistic Regressin Lgistic functin (r sigmid) f(z) = + exp( z) Apply sigmid t linear functin f the input features Pr(G =0 X, )= + exp(x > ) Pr(G = X, )= exp(x > ) + exp(x > ) CS 54 [Spring 07] - H

Sigmid Functin f(x) = + exp (w 0+w x) CS 54 [Spring 07] - H

Fitting Lgistic Regressin Mdels N lnger straightfrward (nt simple least squares) See bk fr discussin f tw-class case Use ptimizatin methds (Newtn-Raphsn) In practice use a sftware library CS 54 [Spring 07] - H

Optimizatin: Lg Likelihd Maximize likelihd f yur training data by assuming class labels are cnditinally independent L( ) = Lg likelihd ny i= Pr(G = k X = x i ), = { 0, } nx `( ) = Pr(G = k X = x i ) i= = p k (x; ) CS 54 [Spring 07] - H

Optimizatin: Lgistic Regressin Lg likelihd fr lgistic regressin nx `( ) = (y > i x i lg( + exp ( > x i ) )) i= Simple gradient descent using derivatives @`( ) = @ nx x i (y i p(x; )) i= Bk illustrates Newtn-Raphsn which uses nd rder infrmatin fr better cnvergence CS 54 [Spring 07] - H

Lgistic Regressin Cefficients Hw t interpret cefficients? Similar t interpretatin fr linear regressin Increasing the ith predictr xi by unit and keeping all ther predictrs fixed increases: Estimated lg dds (class ) by an additive factr i Estimated dds (class ) by a multiplicative factr exp i CS 54 [Spring 07] - H

Example: Suth African Heart Disease Predict mycardial infarctin heart attack Variables: sbp Systlic bld pressure tbacc Tbacc use ldl Chlesterl measure famhist Family histry f mycardial infarctin besity, alchl, age CS 54 [Spring 07] - H

Example: Suth African Heart Disease CS 54 [Spring 07] - H Table 4. (Hastie et al.)

CS 54 [Spring 07] - H Example: Suth African Heart Disease Figure 4. (Hastie et al.) sbp 0 0 0 0 0.0 0.4 0.8 0 50 00 00 60 0 0 0 0 0 tbacc ldl 6 0 4 0.0 0.4 0.8 famhist besity 5 5 5 45 0 50 00 alchl 00 60 0 6 0 4 5 5 5 45 0 40 60 0 40 60 age

Example: Suth African Heart Disease 4 5 6 7 Cefficients βj(λ) 0.0 0. 0.4 0.6 ************************************************************************************************************************************************************************************************************************************* * ************************************************************************************************************************************************************************************************************************************ age famhist ldl tbacc sbp alchl besity 0.0 0.5.0.5.0 β(λ) CS 54 [Spring 07] - H Figure 4. (Hastie et al.)

Linear Separability & Lgistic Regressin What happens in the case when my data is cmpletely separable? Weights g t infinity Infinite number f MLEs Use sme frm f regularizatin t avid this scenari CS 54 [Spring 07] - H

LDA vs Lgistic Regressin LDA estimates the Gaussian parameters and prir (easy!) Lgistic regressin estimates cefficients directly based n maximum likelihd (harder!) Bth have linear decisin bundaries that are different why? LDA assumes nrmal distributin within class Lgistic regressin is mre flexible and rbust t situatins with utliers and nt nrmal class cnditinal densities CS 54 [Spring 07] - H

Multiclass Lgistic Regressin Extensin t K classes: use K - mdels lg Pr(G = j X = x) Pr(G = K X = x) = 0j + > j x Mdel the lg dds f each class t a base class Fit cefficients jintly by maximum likelihd Put them tgether t get psterirs exp( 0i + > Pr(G = i x) = i x) + P j exp( 0j + >,i6= j j x) CS 54 [Spring 07] - H

Lgistic Regressin Prperties Advantages Parameters have useful interpretatins the effect f unit change in a feature is t increase the dds f a respnse multiplicatively by the factr exp i Quite rbust, well develped Disadvantages Parametric, but wrks fr entire expnential family f distributins Slutin nt clsed frm, but still reasnably fast CS 54 [Spring 07] - H

Lgistic Regressin Additinal Cmments Example f a generalized linear mdel with cannical link functin = lgit, crrespnding t Bernulli Fr mre infrmatin, see shrt curse by Heather Turner (http://statmath.wu.ac.at/curses/ heather_turner/glmcurse_00.pdf) Old technique but still very widely used Output layer fr neural netwrks CS 54 [Spring 07] - H

Cmparisn n Vwel Recgnitin CS 54 [Spring 07] - H Table 4. (Hastie et al.)

Generative vs Discriminative Generative: separately mdel class-cnditinal densities and prirs Example: LDA, QDA Discriminative: try t btain class bundaries directly thrugh heuristic r estimating psterir prbabilities Example: Decisin trees, lgistic regressin CS 54 [Spring 07] - H

Generative vs Discriminative Analgy Task is t determine the language smene is speaking Generative: Learn each language and determine which language the speech belngs t Discriminative: Determine the linguistic differences withut learning any language CS 54 [Spring 07] - H