COMP 551 Applied Machine Learning Lecture 4: Linear classification

Similar documents
COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Pattern Recognition 2014 Support Vector Machines

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

IAML: Support Vector Machines

Simple Linear Regression (single variable)

What is Statistical Learning?

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Linear Classification

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

Part 3 Introduction to statistical classification techniques

Resampling Methods. Chapter 5. Chapter 5 1 / 52

Support-Vector Machines

Computational modeling techniques

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

Tree Structured Classifier

ENSC Discrete Time Systems. Project Outline. Semester

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

Admin. MDP Search Trees. Optimal Quantities. Reinforcement Learning

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

The blessing of dimensionality for kernel methods

We can see from the graph above that the intersection is, i.e., [ ).

Lecture 8: Multiclass Classification (I)

Linear programming III

CN700 Additive Models and Trees Chapter 9: Hastie et al. (2001)

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

Artificial Neural Networks MLP, Backpropagation

Section 5.8 Notes Page Exponential Growth and Decay Models; Newton s Law

COMP9444 Neural Networks and Deep Learning 3. Backpropagation

Logistic Regression. and Maximum Likelihood. Marek Petrik. Feb

Support Vector Machines and Flexible Discriminants

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving.

Distributions, spatial statistics and a Bayesian perspective

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

Trigonometric Ratios Unit 5 Tentative TEST date

Stats Classification Ji Zhu, Michigan Statistics 1. Classification. Ji Zhu 445C West Hall

Computational modeling techniques

CS 109 Lecture 23 May 18th, 2016

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

Statistics, Numerical Models and Ensembles

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards:

Eric Klein and Ning Sa

Chapter 3: Cluster Analysis

Reinforcement Learning" CMPSCI 383 Nov 29, 2011!

Data mining/machine learning large data sets. STA 302 or 442 (Applied Statistics) :, 1

Lecture 20a. Circuit Topologies and Techniques: Opamps

Math Foundations 20 Work Plan

ELE Final Exam - Dec. 2018

[COLLEGE ALGEBRA EXAM I REVIEW TOPICS] ( u s e t h i s t o m a k e s u r e y o u a r e r e a d y )

The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition

STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours

Localized Model Selection for Regression

Chapter 3 Kinematics in Two Dimensions; Vectors

Differentiation Applications 1: Related Rates

ECEN 4872/5827 Lecture Notes

The general linear model and Statistical Parametric Mapping I: Introduction to the GLM

Elements of Machine Intelligence - I

NAME: Prof. Ruiz. 1. [5 points] What is the difference between simple random sampling and stratified random sampling?

AP Physics Kinematic Wrap Up

Linear Methods for Regression

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

You need to be able to define the following terms and answer basic questions about them:

Statistical classifiers: Bayesian decision theory and density estimation

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

Kinetic Model Completeness

Optimization Programming Problems For Control And Management Of Bacterial Disease With Two Stage Growth/Spread Among Plants

Hypothesis Tests for One Population Mean

Smoothing, penalized least squares and splines

Data Mining: Concepts and Techniques. Classification and Prediction. Chapter February 8, 2007 CSE-4412: Data Mining 1

Computational modeling techniques

Chapter 15 & 16: Random Forests & Ensemble Learning

MODULE 1. e x + c. [You can t separate a demominator, but you can divide a single denominator into each numerator term] a + b a(a + b)+1 = a + b

Lyapunov Stability Stability of Equilibrium Points

Materials Engineering 272-C Fall 2001, Lecture 7 & 8 Fundamentals of Diffusion

Feedforward Neural Networks

Contents. This is page i Printer: Opaque this

Lead/Lag Compensator Frequency Domain Properties and Design Methods

NUMBERS, MATHEMATICS AND EQUATIONS

Online Model Racing based on Extreme Performance

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

SAMPLING DYNAMICAL SYSTEMS

Comparing Several Means: ANOVA. Group Means and Grand Mean

A Matrix Representation of Panel Data

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

Data Analysis, Statistics, Machine Learning

1 PreCalculus AP Unit G Rotational Trig (MCR) Name:

Checking the resolved resonance region in EXFOR database

ELT COMMUNICATION THEORY

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

Maximum A Posteriori (MAP) CS 109 Lecture 22 May 16th, 2016

7 TH GRADE MATH STANDARDS

MATCHING TECHNIQUES Technical Track Session VI Céline Ferré The World Bank

ANSWER KEY FOR MATH 10 SAMPLE EXAMINATION. Instructions: If asked to label the axes please use real world (contextual) labels

Support Vector Machines and Flexible Discriminants

Transcription:

COMP 551 Applied Machine Learning Lecture 4: Linear classificatin Instructr: Jelle Pineau (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/cmp551 Unless therwise nted, all material psted fr this curse are cpyright f the instructr, and cannt be reused r repsted withut the instructr s written permissin.

Tday s Quiz 1. What is meant by the term verfitting? What can cause verfitting? Hw can ne avid verfitting? 2. Which f the fllwing increases the chances f verfitting (assuming everything else is held cnstant): a) Reducing the size f the training set. b) Increasing the size f the training set. c) Reducing the size f the test set. d) Increasing the size f the test set. e) Reducing the number f features. f) Increasing the number f features. 2 Jelle Pineau

Evaluatin Use crss-validatin fr mdel selectin. Training set is used t select a hypthesis f frm a class f hyptheses F (e.g. regressin f a given degree). Validatin set is used t cmpare best f frm each hypthesis class acrss different classes (e.g. different degree regressin). Must be untuched during the prcess f lking fr f within a class F. Test set: Ideally, a separate set f (labeled) data is withheld t get a true estimate f the generalizatin errr. (Often the validatin set is called test set, withut distinctin.) 3 Jelle Pineau

Validatin vs Train errr [Frm Hastie et al. textbk] High Bias Lw Variance Lw Bias High Variance Predictin Errr Test Sample Training Sample Lw Mdel Cmplexity High FIGURE 2.11. Test and training errr as a functin f mdel cmplexity. 4 Jelle Pineau

Bias vs Variance Gauss-Markv Therem says: The least-squares estimates f the parameters w have the smallest variance amng all linear unbiased estimates. Insight: Find lwer variance slutin, at the expense f sme bias. E.g. Include penalty fr mdel cmplexity in errr t reduce verfitting. Err(w) = i=1:n ( y i - w T x i ) 2 + λ mdel_size λ is a hyper-parameter that cntrls penalty size. 5 Jelle Pineau

Ridge regressin (aka L2-regularizatin) Cnstrains the weights by impsing a penalty n their size: ŵ ridge = argmin w { i=1:n ( y i - w T x i ) 2 + λ j=0:m w j2 } where λ can be selected manually, r by crss-validatin. D a little algebra t get the slutin: ŵ ridge = (X T X+λI) -1 X T Y The ridge slutin is nt equivariant under scaling f the data, s typically need t nrmalize the inputs first. Ridge gives a smth slutin, effectively shrinking the weights, but drives few weights t 0. 6 Jelle Pineau

Lass regressin (aka L1-regularizatin) Cnstrains the weights by penalizing the abslute value f their size: ŵ lass = argmin W { i=1:n ( y i - w T x i ) 2 + λ j=1:m w j } Nw the bjective is nn-linear in the utput y, and there is n clsed-frm slutin. Need t slve a quadratic prgramming prblem instead. Mre cmputatinally expensive than Ridge regressin. Effectively sets the weights f less relevant input features t zer. 7 Jelle Pineau

Cmparing Ridge and Lass Ridge g regularizatin (2 pa w 2 Cnturs f equal regressin errr Lass w 2 1 w? w? w 1 w 1 Cnturs f equal mdel cmplexity penalty 8 Jelle Pineau

A quick lk at evaluatin functins We call L(Y,f w (x)) the lss functin. Least-square / Mean squared-errr (MSE) lss: L(Y, f w (X)) = i=1:n ( y i - w T x i ) 2 Other lss functins? Abslute errr lss: L(Y, f w (X)) = i=1:n y i w T x i 0-1 lss (fr classificatin): L(Y, f w (X)) = i=1:n I ( y i f w (x i ) ) Different lss functins make different assumptins. Squared errr lss assumes the data can be apprximated by a glbal linear mdel with Gaussian nise. 9 Jelle Pineau

Next: Linear mdels fr classificatin Linear Regressin f 0/1 Respnse........................ FIGURE 2.1. A classificatin example in tw dimensins. The classes are cded as a binary variable (BLUE =0, ORANGE =1), and then fit by linear regressin. The line is the decisin bundary defined by x T ˆβ =0.5. Therangeshadedregin dentes that part f input space classified as ORANGE, while the blue regin is classified as BLUE. 10 Jelle Pineau

Classificatin prblems Given data set D=<x i,y i >, i=1:n, with discrete y i, find a hypthesis which best fits the data. If y i {0, 1} this is binary classificatin. If y i can take mre than tw values, the prblem is called multi-class classificatin. 11 Jelle Pineau

Applicatins f classificatin Text classificatin (spam filtering, news filtering, building web directries, etc.) Image classificatin (face detectin, bject recgnitin, etc.) Predictin f cancer recurrence. Financial frecasting. Many, many mre! 12 Jelle Pineau

Simple example Given nucleus size, predict cancer recurrence. Univariate input: X = nucleus size. Binary utput: Y = {NRecurrence = 0; Recurrence = 1} Try: Minimize the least-square errr. nnrecurrence cunt 35 30 25 20 15 10 5 NRecurrence 0 10 12 14 16 18 20 22 24 26 28 30 nucleus size 15 Recurrence recurrence cunt 10 5 0 10 12 14 16 18 20 22 24 26 28 30 nucleus size 13 Jelle Pineau

Predicting a class frm linear regressin Here red line is: Y = X (X T X) -1 X T Y Hw t get a binary utput? 1. Threshld the utput: { y <= t fr NRecurrence, y > t fr Recurrence} 2. Interpret utput as prbability: y = Pr (Recurrence) *3*4.,+405#603404.,+405"6 "&$ " #&) #&( #&' #&$ # Can we find a better mdel?!#&$! "# "! $# $! %# *+,-.+/0/12. 14 Jelle Pineau

Mdeling fr binary classificatin Tw prbabilistic appraches: 1. Discriminative learning: Directly estimate P(y x). 2. Generative learning: Separately mdel P(x y) and P(y). Use Bayes rule, t estimate P(y x): P(y =1 x) = P(x y =1)P(y =1) P(x) 15 Jelle Pineau

Prbabilistic view f discriminative learning Suppse we have 2 classes: y {0, 1} What is the prbability f a given input x having class y = 1? Cnsider Bayes rule: P(y =1 x) = where = 1+ P(x, y =1) P(x) a = ln = P(x y =1)P(y =1) P(x y =1)P(y =1)+ P(x y = 0)P(y = 0) 1 = P(x y = 0)P(y = 0) 1+ exp(ln P(x y =1)P(y =1) P(x y =1)P(y =1) P(x y = 0)P(y = 0) 1 = P(x y = 0)P(y = 0) P(x y =1)P(y =1) ) = ln P(y =1 x) P(y = 0 x) Here σ has a special frm, called the lgistic functin (By Bayes rule; P(x) n tp and bttm cancels ut.) and a is the lg-dds rati f data being class 1 vs. class 0. 1 1+ exp( a) = σ 16 Jelle Pineau

Discriminative learning: Lgistic regressin Idea: Directly mdel the lg-dds with a linear functin: a = ln P(x y =1)P(y =1) P(x y = 0)P(y = 0) = w 0 + w 1 x 1 + + w m x m The decisin bundary is the set f pints fr which a=0. The lgistic functin (= sigmid curve): σ(w T x) = 1 / (1 + e -wtx ) Hw d we find the weights? Need an ptimizatin functin. 17 Jelle Pineau

Fitting the weights Recall: σ(w T x i ) is the prbability that y i =1 (given x i ) 1-σ(w T x i ) be the prbability that y i = 0. Fr y {0, 1}, the likelihd functin, Pr(x 1,y 1,, x n,y h w), is: i=1:n σ(w T x i ) yi (1- σ(w T x i )) (1-yi) (samples are i.i.d.) Gal: Minimize the lg-likelihd (als called crss-entrpy errr functin): - i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) 18 Jelle Pineau

Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: δlg(σ)/δw=1/σ Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + 19 Jelle Pineau

Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: δσ/δw=σ(1-σ) Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + 20 Jelle Pineau

Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: δw T x/δw=x Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + 21 Jelle Pineau

Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: δ(1-σ)/δw= (1-σ)σ(-1) Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + (1-y i )(1/(1-σ(w T x i )))(1-σ(w T x i ))σ(w T x i )(-1) x i ] 22 Jelle Pineau

Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + (1-y i )(1/(1-σ(w T x i )))(1-σ(w T x i ))σ(w T x i )(-1) x i ] = - i=1:n x i (y i (1-σ(w T x i )) - (1-y i )σ(w T x i )) = - i=1:n x i (y i - σ(w T x i )) 23 Jelle Pineau

Gradient descent fr lgistic regressin Errr fn: Err(w) = - [ i=1:n y i lg(σ(w T x i )) + (1-y i )lg(1-σ(w T x i )) ] Take the derivative: Err(w)/ w = - [ i=1:n y i (1/σ(w T x i ))(1-σ(w T x i )) σ(w T x i )x i + (1-y i )(1/(1-σ(w T x i )))(1-σ(w T x i ))σ(w T x i )(-1) x i ] = - i=1:n x i (y i (1-σ(w T x i )) - (1-y i )σ(w T x i )) = - i=1:n x i (y i - σ(w T x i )) Nw apply iteratively: w k+1 = w k + α k i=1:n x i (y i σ(w kt x i )) Can als apply ther iterative methds, e.g. Newtn s methd, crdinate descent, L-BFGS, etc. 24 Jelle Pineau

Mdeling fr binary classificatin Tw prbabilistic appraches: 1. Discriminative learning: Directly estimate P(y x). 2. Generative learning: Separately mdel P(x y) and P(y). Use Bayes rule, t estimate P(y x): P(y =1 x) = P(x y =1)P(y =1) P(x) 25 Jelle Pineau

What yu shuld knw Basic definitin f linear classificatin prblem. Derivatin f lgistic regressin. Linear discriminant analysis: definitin, decisin bundary. Quadratic discriminant analysis: basic idea, decisin bundary. LDA vs QDA prs/cns. Wrth reading further: Under sme cnditins, linear regressin fr classificatin and LDA are the same (Hastie et al., p.109-110). Relatin between Lgistic regressin and LDA (Hastie et al., 4.4.5) 26 Jelle Pineau

Final ntes Yu dn t yet have a team fr Prject #1? => Use mycurses. Yu dn t yet have a plan fr Prject #1? => Start planning! Feedback n tutrial 1? 27 Jelle Pineau